How is the reference data being generated?

awatson1978 commented 7 years ago

Hi again, So, everything is going swimmingly, and I think I've got this all sorted out so a user can upload and display their 23andMe data.

Question: the reference data. How is it being generated? I'm currently taking an Intro to Genomics class, and we've been covering .sam, .bam, .vcf, .bed, and .fastq files. But I'm having a bit of trouble following what's going on in the /data directory.

/data/annotations/10000_virtual_snvs.json... that's from the 10,000 Genomes Project, right? Is that basically a JSON version of a .vcf file?

eweitz commented 7 years ago

Question: the reference data. How is it being generated?

In brief, the reference data ultimately comes from the National Center for Biotechnology Information (NCBI). Ideogram's dataflow scripts simply download and transform public data from NCBI.

Further details are below, if you're interested.

The reference chromosomes fall into three categories of decreasing resolution: those with cytogenetic band data, those with centromere data, and those without centromere data.

Also, although it is generally better to use accessioned upstream data, it is possible to see chromosomes with any of those resolutions by using draft or custom data.

Chromosomes with cytogenetic band data

Cytogenetic band data for human, mouse and rat comes from the "Ideogram Data" tab atop the Genome Decoration Page (GDP) at NCBI, i.e. the FTP directory ftp://ftp.ncbi.nlm.nih.gov/pub/gdp/. The Ideogram examples for those organisms use a naive JSON transformation of those TSV files from GDP. For example, the cytogenetic band data for human in /data/bands/native is a JSON conversion of a TSV file in /data/bands/ncbi, which I copied from that upstream GDP FTP page. The TSV-to-JSON conversion is done by convert_band_data.py.

Cytogenetic band data only exists for a few high-value organisms. The remaining reference data falls into two categories: organisms that have a genome assembly with centromere data in AGP files, and organisms that do not.

Chromosomes with centromere data

For example, chimpanzee (Pan troglodytes) has centromere data. More precisely, the chimpanzee assembly Pan_tro_3.0 (GCF_000001515.7) has a row with a column labeled centromere in its AGP files, e.g. in chr1.agp.gz here. You can also get to those files by going to that genome's NCBI Assembly page and clicking on the "Download the RefSeq assembly" link at right, then following the links indicated by the previous ("here") URL.

Roughly 15 organisms have such data. The search for, download and formatting of genome assemblies with AGPs that contain centromere data is done by get_chromosomes.py.

Chromosomes without centromere data

Mosquito (Anopheles gambiae) and several hundred other organisms lack assemblies with centromere data, but do have sufficient data to be automatically supported by Ideogram. Here again, the data comes from NCBI Assembly.

In particular, this class of supported organism is defined by having genomes in NCBI Assembly that have an assembly level of "Complete genome" or "Chromosome". Ideogram uses NCBI's EUtils web API to get that data. See the getAssemblyAndChromosomesFromEutils method for implementation details, and Issue #45 for background discussion.

Chromosomes with draft data

Finally, it is also possible to load custom chromosome reference data. I don't think this has been tried, but I imagine it could be useful as a quality assurance step while submitting to an upstream assembly repository like NCBI Assembly. This approach could be used to generate custom ideograms that show cytogenetic bands, centromeres, or simply chromosome lengths.

One would create a TSV file like banana.tsv from one's own not-yet-accessioned data, then run convert_band_data.py to produce a file like banana.js, then include that in a <script> element as done in the Ploidy, rearrangement (source) example.

eweitz commented 7 years ago

/data/annotations/10000_virtual_snvs.json... that's from the 10,000 Genomes Project, right? Is that basically a JSON version of a .vcf file?

That file is actually just a set of virtual, i.e. simulated, single nucleotide variations (SNVs) that are randomly distributed throughout the genome. It's basically test data, generated by create_annots.py and not pulled from any upstream data source like the 1000 Genomes Project or dbSNP.

I think I've got this all sorted out so a user can upload and display their 23andMe data.

Very cool! Integrating 23AndMe into Ideogram's dataflow scripts would be a great addition, if you have a PR in mind.

I've done something somewhat similar for AncestryDNA raw data in analyze_ancestrydna.py. (Though please note that the ClinVar integration is not reliable, not official and basically has been for my personal experimentation.)

awatson1978 commented 7 years ago

Hmmm.... going to have to think through this, and consult with my professor. I may have overstated having gotten the 23andMe data completely working.

I definitely got a .bed file loaded into it, and have effectively defined a genomics test panel:

screen shot 2017-03-06 at 2 27 07 am

Now I need to use this to scan the 23andMe results. Exciting stuff!

eweitz commented 7 years ago

Awesome! It sounds like you are well on your way.

One thing to note: 23andMe uses GRCh37, the previous major version of the human reference genome assembly. However, Ideogram defaults to GRCh38, the current major version of the human reference genome. GRCh37 and GRCh38 need to be distinguished because the chromosomes are of different lengths between the two versions; genomic coordinates in GRCh37 are not equivalent to those in GRCh38.

So, if you're not already doing so, you'll likely want to specify assembly: "GRCh37" in the Ideogram configuration, as done in e.g. ancestry_tracks.html.

eweitz / ideogram