Development of Yeast Ideogram

eweitz / ideogram

Chromosome visualization for the web

https://eweitz.github.io/ideogram

Other

294 stars 71 forks source link

Development of Yeast Ideogram #239

Closed vsoch closed 3 years ago

vsoch commented 4 years ago

hey @eweitz! As promised, I had a chance to start working on deriving data to generate the yeast ideaogram with expression data. I have some questions that I'm hoping you might be able to help with, or point me to who can. First, what I've done.

I found an export file for SGD so I could get a complete list of features, including names, start/stop, chromosome, etc.
I read in the human equivalent so that I know to generate annotations in that structure. It appears to be a top level dictionary with keys and annots, and then each entry in annots is a dictionary with a chromosome (chr) label and a list of annots, each with data corresponding to what is stated in the keys.
The keys for the chromosomes need to correspond with the ones in the data file you produced (e.g., roman numerals and not numbers).
I reproduced this file for yeast, and then was able to generate https://vsoch.github.io/yeast-ideogram/

I ran into some questions along the way - and I'm hoping you might have comment!

The human-derived data has "gene-type" as an integer, but it's not clear what this corresponds to. I put "ORF" but it's probably wrong. What lookup can I use, and what is best practice for matching to genes?
When I do a type histogram even with expression-level 0, the ideogram is still generated showing expression values (I assume this is what it's showing). Why is that?
Why do some genes, when parsing the chromosome start and end coordinates, return a negative value (meaning end is less than start)? For the time being, I didn't include these, because it didn't make sense. Possibly the coordinates are in different units?
Does 2-micro correspond to the chromosome called MT?

I've provided the code for the interface in the repo, and please take note of the data folder for data and this script for how I generated the annotations file. My goal is to have a visualization that makes sense, which right now probably means figuring out what the histogram is actually showing. Then I would want an interface that a user can quickly search for a gene in context of a dataset, and then see it be highlighted in the plot. This would be easy to do with standard javascript if the genes had a unique id that could quickly be grabbed. Thank you for your wisdom!

vsoch commented 4 years ago

hey @eweitz did you have any thoughts about this? I can do any development work needed if you point me to the right places to look.

eweitz commented 4 years ago

Sorry to delay! Your draft looks great so far.

The human-derived data has "gene-type" as an integer, but it's not clear what this corresponds to. I put "ORF" but it's probably wrong. What lookup can I use, and what is best practice for matching to genes?

The gene-type integer in the annotations JSON should align with the filter map value specified in the client HTML. See e.g JSON in SRR562646.json and filterMap in annotations-histogram.html#L131. Those files power the "Annotations, histogram" example.

To the best of my recollection that JSON was generated via formatter.py. Incidentally, I see what looks like a mismatch between the gene-type labels and integers there and in annotations-histogram.html. I'll aim to amend the labels, but the gist is the same. If you want to enable interactive filtering in your histogram, things like "ORF" would need to be an integer with an appropriate fitlerMap, etc., as done in annotations-histogram.html.

I hope those pointers help.

When I do a type histogram even with expression-level 0, the ideogram is still generated showing expression values (I assume this is what it's showing). Why is that?

By default, the bars in Ideogram histograms show the number of annotations in that range (details). If you do client-side filters to exclude annotations with no expression, then the bars would represent the number of annotations with some expression -- but by the default the histogram heights correspond to the number of raw features in the annotations JSON.

Why do some genes, when parsing the chromosome start and end coordinates, return a negative value (meaning end is less than start)? For the time being, I didn't include these, because it didn't make sense. Possibly the coordinates are in different units?

This might be due to the gene being encoded on the antisense DNA strand, as in e.g. human MAPT-AS1.

Does 2-micro correspond to the chromosome called MT?

I don't believe so. From a glance at literature, "2-micro" seems to be another type of non-nuclear DNA: a plasmid. Could you link me to where I can find your reference genome information -- i.e. a URL to where you see this?

vsoch commented 4 years ago

I can't recall where I downloaded the data file from, but it's a link somewhere on SGD, and it provided an SGD_features.tab and SGD_features.README. Here is the README

SGD_features.tab

The latest version of the SGD_features.tab file is based on Genome Version R64-2-1.

The SGD_features.tab file is updated weekly (Saturday).

NOTE: On 4 September 2004, the SGD_features.tab file replaced the previously
used chromosomal_feature.tab file.

File contents:

1. Information on current chromosomal features in SGD, including Dubious ORFs. 
Also contains coordinates of intron, exons, and other subfeatures that are located
within a chromosomal feature.

2. The relationship between subfeatures and the feature in which they
are located is identified by the feature name in column #7 (parent
feature). For example, the parent feature of the intron found in
ACT1/YFL039C will be YFL039C. The parent feature of YFL039C is
chromosome 6.

3. The coordinates of all features are in chromosomal coordinates.

Columns within SGD_features.tab:

1.   Primary SGDID (mandatory)
2.   Feature type (mandatory)
3.   Feature qualifier (optional)
4.   Feature name (optional)
5.   Standard gene name (optional)
6.   Alias (optional, multiples separated by |)
7.   Parent feature name (optional)
8.   Secondary SGDID (optional, multiples separated by |)
9.   Chromosome (optional)
10.  Start_coordinate (optional)
11.  Stop_coordinate (optional)
12.  Strand (optional)
13.  Genetic position (optional)
14.  Coordinate version (optional)
15.  Sequence version (optional)
16.  Description (optional)

Note that "chromosome 17" is the mitochondrial chromosome.

The SGD_features.tab file is complemented by GFF3 file saccharomyces_cerevisiae.gff

And then, for example, a row in the .tab file might look like

['S000002143',
 'ORF',
 'Dubious',
 'YAL069W',
 '',
 '',
 'chromosome 1',
 '',
 '1',
 '335',
 '649',
 'W',
 '',
 '1996-07-31',
 '1996-07-31',
 'Dubious open reading frame; unlikely to encode a functional protein, based on available experimental and comparative sequence data']

And then I map the fields as follows to generate the data for the visualization:

    # Indices of data we need are:
    # chromosome (line[8])
    # name (line[3])
    # start (line[9])
    # length (stop-start) or (line[10]-line[9])
    # expression-level: 0
    # gene-type

And for annotations, since it seems to be in the format of a roman numeral, I have a function to do the conversion of the chromosome name. But when I found that 2=micron I wasn't sure how to convert and called it MT :) You can see the full generation here https://github.com/vsoch/yeast-ideogram/blob/master/data/generate_yeast_data.py. So I understand how I'd derive an expression level for some dataset (normalize the expression and then bin it) but I think I need to figure out what gene-type should be. Maybe feature type, number 2? Here is the full set of unique values for the .tab file:

{'ARS',
 'ARS_consensus_sequence',
 'CDS',
 'LTR_retrotransposon',
 'ORF',
 'W_region',
 'X_element',
 'X_element_combinatorial_repeat',
 'X_region',
 'Y_prime_element',
 'Y_region',
 'Z1_region',
 'Z2_region',
 'blocked_reading_frame',
 'centromere',
 'centromere_DNA_Element_I',
 'centromere_DNA_Element_II',
 'centromere_DNA_Element_III',
 'external_transcribed_spacer_region',
 'five_prime_UTR_intron',
 'gene_group',
 'intein_encoding_region',
 'internal_transcribed_spacer_region',
 'intron',
 'long_terminal_repeat',
 'mating_type_region',
 'matrix_attachment_site',
 'ncRNA_gene',
 'non_transcribed_region',
 'noncoding_exon',
 'not in systematic sequence of S288C',
 'not physically mapped',
 'origin_of_replication',
 'plus_1_translational_frameshift',
 'pseudogene',
 'rRNA_gene',
 'silent_mating_type_cassette_array',
 'snRNA_gene',
 'snoRNA_gene',
 'tRNA_gene',
 'telomerase_RNA_gene',
 'telomere',
 'telomeric_repeat',
 'transposable_element_gene'}

eweitz commented 4 years ago

Thanks for the overview -- the extra context helps.

what gene-type should be. Maybe feature type, number 2?

That makes sense. I might even use a subset of those, only including features that have a feature type that contains the string "gene" ("gene" in line[2], or some such). That's because features like "non_transcribed_region" and "telomere" would not be genes, i.e. not expressed.

The approach might need refinement -- I would expect to see something like "mRNA_gene" or "protein_coding_gene" -- but might be a good start. Printing to stdout the number of features by feature type would give a sense of their distribution, and thus maybe reveal large missed classes of genes.

vsoch commented 4 years ago

A quick update! I updated the script to use a subset of the features (ORF + those with gene in the name) and still grabbed a set of counts across the entire data:

{
    "ORF": 6604,
    "CDS": 7074,
    "ARS": 352,
    "telomere": 32,
    "telomeric_repeat": 31,
    "X_element": 32,
    "X_element_combinatorial_repeat": 28,
    "long_terminal_repeat": 383,
    "ARS_consensus_sequence": 196,
    "intron": 377,
    "ncRNA_gene": 18,
    "noncoding_exon": 484,
    "tRNA_gene": 299,
    "snoRNA_gene": 77,
    "centromere": 16,
    "centromere_DNA_Element_I": 16,
    "centromere_DNA_Element_II": 16,
    "centromere_DNA_Element_III": 16,
    "transposable_element_gene": 91,
    "LTR_retrotransposon": 50,
    "pseudogene": 12,
    "Y_prime_element": 19,
    "five_prime_UTR_intron": 24,
    "plus_1_translational_frameshift": 47,
    "matrix_attachment_site": 8,
    "gene_group": 3,
    "snRNA_gene": 6,
    "rRNA_gene": 27,
    "external_transcribed_spacer_region": 8,
    "internal_transcribed_spacer_region": 8,
    "non_transcribed_region": 3,
    "blocked_reading_frame": 6,
    "origin_of_replication": 8,
    "telomerase_RNA_gene": 1,
    "silent_mating_type_cassette_array": 2,
    "W_region": 2,
    "X_region": 3,
    "Y_region": 3,
    "Z1_region": 3,
    "Z2_region": 2,
    "mating_type_region": 1,
    "intein_encoding_region": 1,
    "not physically mapped": 10,
    "not in systematic sequence of S288C": 55
}

And here are the counts, most of the gene names are rather tiny, the only one I'm including (which I'm not sure if I should) is ORF.

Screenshot_2020-10-22 feature_counts - Jupyter Notebook

Okay! Here is the updated interface! https://vsoch.github.io/yeast-ideogram/. There are two warnings:

Chromosome "MT" undefined in ideogram; 2 annotations not shown
Chromosome "XVII" undefined in ideogram; 54 annotations not shown

I'm also wondering if we need to better group or select from the features (e.g., what we are currently using for gene-types). Once this looks good to you, what I'll do is modify it to be able to read in actual expression values (right now they are randomly assigned to levels).

vsoch commented 4 years ago

okay, went ahead and tried creating the same for a dataset anyway! I was able to simply read in a data file and then generate named bins that match to the expression levels. I also updated the README with some of our notes. Here is the one for the dataset, doesn't look remarkably different! https://vsoch.github.io/yeast-ideogram/dataset.html

eweitz commented 4 years ago

The graph and new examples are illuminating!

CDS (CoDing Sequence) was clearly the big class of missing genes.

I see no other feature types that are genes but wouldn't be covered by something like "gene" in line[2]. ORF (Open Reading Frame), ARS (Autonomously Replicating Sequence), long_terminal_repeat (LTRs), etc. are all not typically classified as genes. A simple parsing rule like line[2] === "CDS" or "gene" in line[2] might precisely extract feature types that are genes.

I see in your latest example that ORF, ARS, long terminal repeat, and gene group are all filters in the gene type facet. It might be worth separating those into a different facet, e.g. "other feature types" -- or at least worth verifying with a biologist at SGD that those types are indeed considered gene types by folks interested in yeast gene expression.

SGD seems to use its own genome annotation format. You might also be interested in NCBI documentation on feature types and annotations. Ensembl is the other big provider of genome annotations, and they use basically the same format as NCBI.

Okay! Here is the updated interface! https://vsoch.github.io/yeast-ideogram/. There are two warnings: Chromosome "MT" undefined in ideogram; 2 annotations not shown Chromosome "XVII" undefined in ideogram; 54 annotations not shown

The MT warning can be fixed by adding showNonNuclearChromosomes: true to your Ideogram configuration object.

The chromosome XVII warning is intriguing! Researching, I find:

Mortimer and Schild published the first comprehensive genetic map of S. cerevisiae in 1980, which included 17 chromosomes and discussed the possible existence of an eighteenth (Mortimer and Schild 1980). Klapholz and Esposito (1982) soon showed that “chromosome XVII” was actually the left arm of chromosome XIV.

Engel SR, Dietrich FS, Fisk DG, et al. The reference genome sequence of Saccharomyces cerevisiae: then and now. G3 (Bethesda). 2014;4(3):389-398. Published 2014 Mar 20. doi:10.1534/g3.113.008995

So the presence of chromosome XVII in that dataset might be an anomaly or historical artifact.

vsoch commented 4 years ago

Okay, next round of changes ready! I did the following:

only included CDS, and anything with "gene," and added a new catch all for "other types." Since I want to match integers in the interface I still based it on matching full names (and not "if gene in feature")
I probably don't have the expertise to submit one of those files, but I could help if someone else is interested.
showNonNuclearChromosomes set to true (I had it commented out)
that's super interesting! I wanted to look into this a bit. The SGD features file includes these specific entities that were given for this chromosome XVII

[['tP(UGG)Q', 731, 71, 1, 7], ['Q0010', 3952, 386, 1, 11], ['Q0017', 4254, 161, 1, 11], ['Q0020', 6546, 1648, 1, 4], ['tW(UCA)Q', 9374, 73, 1, 7], ['Q0032', 11667, 290, 1, 11], ['Q0050', 13818, 2504, 1, 11], ['Q0055', 13818, 5012, 1, 11], ['Q0060', 13818, 6178, 1, 11], ['Q0065', 13818, 8117, 1, 11], ['Q0070', 13818, 9349, 1, 11], ['Q0045', 13818, 12883, 1, 11], ['Q0075', 24156, 1099, 1, 11], ['Q0080', 27666, 146, 1, 11], ['Q0085', 28487, 779, 1, 11], ['ORI7', 30220, 374, 1, 11], ['Q0092', 30874, 140, 1, 11], ['ORI2', 32231, 270, 1, 11], ['tE(UUC)Q', 35373, 71, 1, 7], ['Q0110', 36540, 2039, 1, 11], ['Q0115', 36540, 3725, 1, 11], ['Q0120', 36540, 5711, 1, 11], ['Q0105', 36540, 7107, 1, 11], ['ORI6', 45227, 2700, 1, 11], ['Q0130', 46723, 230, 1, 11], ['tS(UGA)Q2', 48201, 89, 1, 7], ['Q0140', 48901, 1196, 1, 11], ['Q0142', 51052, 176, 1, 11], ['Q0143', 51277, 152, 1, 11], ['Q0144', 54109, 329, 1, 11], ['Q0158', 58009, 4438, 1, 4], ['Q0160', 61022, 707, 1, 11], ['tT(UGU)Q1', 63862, 75, 1, 7], ['tC(GCA)Q', 64415, 75, 1, 7], ['tH(GUG)Q', 64596, 74, 1, 7], ['Q0182', 65770, 404, 1, 11], ['tL(UAA)Q', 66095, 84, 1, 7], ['tQ(UUG)Q', 66210, 75, 1, 7], ['tK(UUU)Q', 67061, 73, 1, 7], ['tR(UCU)Q1', 67309, 72, 1, 7], ['tG(UCC)Q', 67468, 74, 1, 7], ['tD(GUC)Q', 68322, 74, 1, 7], ['tS(GCU)Q1', 69203, 85, 1, 7], ['tR(ACG)Q2', 69289, 73, 1, 7], ['tA(UGC)Q', 69846, 75, 1, 7], ['tI(GAU)Q', 70162, 75, 1, 7], ['tY(GUA)Q', 70824, 83, 1, 7], ['tN(GUU)Q', 71433, 70, 1, 7], ['tM(CAU)Q1', 72630, 75, 1, 7], ['Q0250', 73758, 755, 1, 11], ['Q0255', 74495, 1489, 1, 11], ['tF(GAA)Q', 77431, 74, 1, 7], ['tV(UAC)Q', 78533, 75, 1, 7], ['Q0275', 79213, 809, 1, 11], ['ORI5', 82329, 271, 1, 11], ['tM(CAU)Q2', 85035, 77, 1, 7], ['Q0285', 85295, 482, 1, 2], ['Q0297', 85554, 155, 1, 11]]

I found where the file came from - the last update was 2019, but indeed you can search for these in the file and find that they are flagged with chromosome 17. I was first going to remove them, but would it be more correct to change them to be on the correct chromosome? I'll try that for now, and print out a warning that it's the case.

Interesting, all of the CDS type (7 something K) don't have an assigned chromosome so the graph shows as empty. The samei is true for "gene group" so I put these into the catch all "other types" because they will likely never appear on the plot.

https://vsoch.github.io/yeast-ideogram/

A few more questions for you!

Why is the bottom half of VII empty - is this an artifact in my data (the SGD features, for example?)
If I would want to do some custom callback (e.g., when a user selects a box to get a particular set of annotations) and then possibly show that list and values in a table on the right, where would be the right entrypoint?

I think I'm almost ready to integrate it into a more proper UI, alongside a tool / database!

eweitz commented 4 years ago

I found where the file came from - the last update was 2019, but indeed you can search for these in the file and find that they are flagged with chromosome 17.

I see the problem. That link was key -- thank you. For our convenience, full links to relevant files:

SGD_features.README says:

Note that "chromosome 17" is the mitochondrial chromosome.

So chromosome 17 is MT (sometimes rendered as "chrMT" or "chrmt"), not a relic "chrXVII". This relates to a prior comment that I likely didn't make clear enough -- 2-micron is not MT. It is a plasmid, which is another non-nuclear chromosome.

So the yeast genome has two non-nuclear chromosomes: MT and 2-micron.

I recommend omitting 2-micron and its features for now. Passing over those in your script would simplify things until Ideogram.js supports 2-micron.

Why is the bottom half of VII empty - is this an artifact in my data (the SGD features, for example?)

Likely cause is the chromosome list in yeast-annots-random.json, which is missorted and has a duplicate. Via console in https://vsoch.github.io/yeast-ideogram/:

ideogram.rawAnnots.annots.map((annotsByChr) => annotsByChr.chr); (18) ["I", "II", "III", "IV", "IX", "MT", "V", "VI", "VII", "VIII", "X", "XI", "XII", "XIII", "XIV", "XIV", "XV", "XVI"]

Oh Roman numerals :). Since you're dealing only with yeast, I might just manually define the sorted chromosomes list in Python and use index, e.g.

chrs = [
  "I", "II", "III", "IV", "V", "VI", "VII", "VIII", "IX", "X", 
  "XI", "XII", "XIII", "XIV", "XV", "XVI",
  "MT"
]

print(chrs.index("IX")) # Prints: 8

If you prefer a more involved but systematic approach, Ideogram.js has a chromosome sorting algorithm that could be ported from JavaScript to Python. It handles diverse naming conventions with typical (Arabic) numerals, Roman numerals, and non-nuclear chromosomes. See isRoman, parseRoman, and sortChromosomes.

I will likely add support for 2-micron, but can't commit to a timeline due to busy Ideogram development elsewhere.

If I would want to do some custom callback (e.g., when a user selects a box to get a particular set of annotations) and then possibly show that list and values in a table on the right, where would be the right entrypoint?

Calling filterAnnots as done in the Annotations, histogram or Differential expression examples via a plain-old event handler should work. The former is what your prototype is based on, and the latter expands on the former.

See filterGenes and writeTable in the "Differential expression" example. In particular, ideogram.filteredAnnots returns the list of filtered annotations.

vsoch commented 4 years ago

A quick update! There isn't strong interest to add this to the application I'm working on, so I think we should leave the issue open and I'm happy to come back and continue development if/when someone else is interested. Would this be okay with you?

eweitz commented 4 years ago

OK, I'll keep this open for a while. Thanks for exploring -- and helping improve -- Ideogram for yeast!