AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
129 stars 19 forks source link

Handle taxonomy id edge cases for strains #2123

Open kurtwheeler opened 4 years ago

kurtwheeler commented 4 years ago

Context

1722

https://github.com/AlexsLemonade/refinebio/pull/2035 https://www.ebi.ac.uk/ena/data/view/SRS663100&display=xml

We now support organisms with strains by using a single strain's transcriptome index for the whole species. Which strain to use is chosen by an expert. However this is complicated because NCBI has a different organism with a different taxonomy id for each strain.

Problem or idea

It seems like generally ENA returns the taxonomy id for the organism rather than the strain, but I found some edge cases where the only information we have on the species is:

<TAXON_ID>511145</TAXON_ID>
<SCIENTIFIC_NAME>Escherichia coli str. K-12 substr. MG1655</SCIENTIFIC_NAME>

(https://www.ebi.ac.uk/ena/data/view/SRS663100&display=xml)

This means that we can't easily tell that this sample has the organism "E. coli", so we end up thinking that we don't have a transcriptome index for it.

Solution or next step

@jaclyn-taroni has agreed that trying to use the above metadata to map to "E. coli" via some manipulation of the SCIENTIFIC_NAME would be messy and bug prone. Therefore the only solution we have at the moment is to create a mapping from strain-specific-taxonomy-ids to the taxonomy id for the organism (i.e. https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=info&id=562 rather than https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=info&id=511145)

cgreene commented 4 years ago

Should we loop Georgia from the initial ticket in? I imagine she could add that to the spreadsheet if we defined a format or could add an additional spreadsheet with something like a primary tax id in one column and "other tax id" in another column

georgiadoing commented 4 years ago

The only thing I can think of is to try and use the 'Lineage' visible on the NCBI Taxonomy Browser pages to traverse the tree upward, through parent nodes, until the rank of 'species' is identified and then use that taxon id. I am not sure if the lineage tree is easily accessible like that.. but I think @cgreene 's suggestion of having a spreadsheet, like a severely abridged lineage, that contained the essential information (eg. the species level taxon id and corresponding strain and sub-strain level or unranked taxon ids) would work just as well. Given the small number of species we are starting with that is something I could definitely curate manually. As more and more species have enough data to be worth compiling into compendia I'm not sure if I can promise scalability with a spreadsheet I make though :)

cgreene commented 4 years ago

@georgiadoing : let me see if I can re-describe this. You're saying go from this:

cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales; Enterobacteriaceae; Escherichia; Escherichia coli; Escherichia coli K-12 and remove elements until you reach the key termed "species" (at least mouseover tooltip says that) - in this case Escherichia coli and use whatever the transcriptome index is for that taxonomy id. Is that right?

Would it be possible for you to curate the Pseudomonas aeruginosa strains that have RNA-seq data available to the taxonomy ID that we used to run the PAO1 samples? This way we can check to make sure that whatever solution gets developed is consistent with running those strains against that transcriptome index.

georgiadoing commented 4 years ago

@cgreene Yes to both! I think you described well what I had in mind and also I will curate the Pseudomonas aeruginosa strains. It will be good for me to familiarize myself with what Pseudomonas aeruginosa strains have data out there anyway, and what kinds of things people have named them!

georgiadoing commented 4 years ago

@cgreene @kurtwheeler I have updated the google sheet listing assembles to include a page on Pseudomonas aeruginosa taxonomy IDs including a handful of IDs that I would approve to be included in a Pseudomonas aeruginosa compendium and a couple that I would not. Let me know if this this is helpful or if you'd like more information.

https://docs.google.com/spreadsheets/d/1Lbi68UP2dQtfp-KoxtXpE7jhCOxgP_FweGznbbOiMkw/edit?usp=sharing

cgreene commented 4 years ago

@georgiadoing : ok - I am looking at sheet 2 (direct link, I think, is this: https://docs.google.com/spreadsheets/d/1Lbi68UP2dQtfp-KoxtXpE7jhCOxgP_FweGznbbOiMkw/edit#gid=664081881 ).

Can you help me with those headers? My current read is: anything in the subtree of 287 is a yes. Anything else within that subtree is a no. Otherwise, everything within 208964;652611;208963;100974;1279007 and their subtrees are also a yes. Is that correct?

georgiadoing commented 4 years ago

@cgreene Yes, short and sweet version: subtree 287 is a 'yes'.

I included the IDs of 5 children of subtree 287 in case the tree cannot be automatically traversed or as examples of subtrees of 287 that would be good to capture as well. The 'no' IDs are subtrees that branch above 287 but still have 'pseudomonas' in the name and I thought of as examples of things that might reasonably but errantly be captured.