Closed awatson1978 closed 7 years ago
Question: the reference data. How is it being generated?
In brief, the reference data ultimately comes from the National Center for Biotechnology Information (NCBI). Ideogram's dataflow scripts simply download and transform public data from NCBI.
Further details are below, if you're interested.
The reference chromosomes fall into three categories of decreasing resolution: those with cytogenetic band data, those with centromere data, and those without centromere data.
Also, although it is generally better to use accessioned upstream data, it is possible to see chromosomes with any of those resolutions by using draft or custom data.
Cytogenetic band data for human, mouse and rat comes from the "Ideogram Data" tab atop the Genome Decoration Page (GDP) at NCBI, i.e. the FTP directory ftp://ftp.ncbi.nlm.nih.gov/pub/gdp/. The Ideogram examples for those organisms use a naive JSON transformation of those TSV files from GDP. For example, the cytogenetic band data for human in /data/bands/native is a JSON conversion of a TSV file in /data/bands/ncbi, which I copied from that upstream GDP FTP page. The TSV-to-JSON conversion is done by convert_band_data.py.
Cytogenetic band data only exists for a few high-value organisms. The remaining reference data falls into two categories: organisms that have a genome assembly with centromere data in AGP files, and organisms that do not.
For example, chimpanzee (Pan troglodytes) has centromere data. More precisely, the chimpanzee assembly Pan_tro_3.0 (GCF_000001515.7) has a row with a column labeled centromere
in its AGP files, e.g. in chr1.agp.gz
here. You can also get to those files by going to that genome's NCBI Assembly page and clicking on the "Download the RefSeq assembly" link at right, then following the links indicated by the previous ("here") URL.
Roughly 15 organisms have such data. The search for, download and formatting of genome assemblies with AGPs that contain centromere data is done by get_chromosomes.py.
Mosquito (Anopheles gambiae) and several hundred other organisms lack assemblies with centromere data, but do have sufficient data to be automatically supported by Ideogram. Here again, the data comes from NCBI Assembly.
In particular, this class of supported organism is defined by having genomes in NCBI Assembly that have an assembly level of "Complete genome" or "Chromosome". Ideogram uses NCBI's EUtils web API to get that data. See the getAssemblyAndChromosomesFromEutils method for implementation details, and Issue #45 for background discussion.
Finally, it is also possible to load custom chromosome reference data. I don't think this has been tried, but I imagine it could be useful as a quality assurance step while submitting to an upstream assembly repository like NCBI Assembly. This approach could be used to generate custom ideograms that show cytogenetic bands, centromeres, or simply chromosome lengths.
One would create a TSV file like banana.tsv from one's own not-yet-accessioned data, then run convert_band_data.py to produce a file like banana.js, then include that in a <script>
element as done in the Ploidy, rearrangement (source) example.
/data/annotations/10000_virtual_snvs.json... that's from the 10,000 Genomes Project, right? Is that basically a JSON version of a .vcf file?
That file is actually just a set of virtual, i.e. simulated, single nucleotide variations (SNVs) that are randomly distributed throughout the genome. It's basically test data, generated by create_annots.py and not pulled from any upstream data source like the 1000 Genomes Project or dbSNP.
I think I've got this all sorted out so a user can upload and display their 23andMe data.
Very cool! Integrating 23AndMe into Ideogram's dataflow scripts would be a great addition, if you have a PR in mind.
I've done something somewhat similar for AncestryDNA raw data in analyze_ancestrydna.py. (Though please note that the ClinVar integration is not reliable, not official and basically has been for my personal experimentation.)
Hmmm.... going to have to think through this, and consult with my professor. I may have overstated having gotten the 23andMe data completely working.
I definitely got a .bed
file loaded into it, and have effectively defined a genomics test panel:
Now I need to use this to scan the 23andMe results. Exciting stuff!
Awesome! It sounds like you are well on your way.
One thing to note: 23andMe uses GRCh37, the previous major version of the human reference genome assembly. However, Ideogram defaults to GRCh38, the current major version of the human reference genome. GRCh37 and GRCh38 need to be distinguished because the chromosomes are of different lengths between the two versions; genomic coordinates in GRCh37 are not equivalent to those in GRCh38.
So, if you're not already doing so, you'll likely want to specify assembly: "GRCh37"
in the Ideogram configuration, as done in e.g. ancestry_tracks.html.
Hi again, So, everything is going swimmingly, and I think I've got this all sorted out so a user can upload and display their 23andMe data.
Question: the reference data. How is it being generated? I'm currently taking an Intro to Genomics class, and we've been covering
.sam
,.bam
,.vcf
,.bed
, and.fastq
files. But I'm having a bit of trouble following what's going on in the/data
directory./data/annotations/10000_virtual_snvs.json
... that's from the 10,000 Genomes Project, right? Is that basically a JSON version of a.vcf
file?