Closed jonrkarr closed 4 years ago
Is it possible to use the REST API /taxon/is_child/?src_tax_id={}&target_tax_id={} to achieve taxonomic filtering of the results?
The issue with using this call is that it would require many synchronous API calls, which would slow down the page.
With this API, the workflow would be as follows: 1) The frontend would call the API to get the total data 2) Afterwards, the frontend would have to parse through the data to find the taxonomic ID's, and would have to call the API up to 8 times (for see whether the ID is a child for each of the taxonomic nodes), per each unique taxonomic ID.
Is there a way this computing can be done on the backend?
In Metabolite category, the target organism is a constant given by the user, which is Escherichia coli E1002 in the case of the screenshot. Same applies to other categories as far as taxonomic filtering goes. When the user moves the slider down to "Proteobacteria," , the table displays only the organisms that are the children of Proteobacteria. This is achieved because the list of source organisms' NCBI Taxonony IDs is known, so one only needs to use the API mentioned to check if the data point should be filtered out or retained. I can improve the utility by changing the source organism to be a list, the return value being a list of boolean values. For instance, if one wants to check whether a list of source organisms' IDs [0,1,2,3,4] is the child of target organism 5, one can use the improved API and get a hypothetical return value in the form of [True, True, False, True, False], which indicates that 0 is the child of 5, 1 is the child of 5, so on and so forth.
This might work, however we only really need to do this calculation once (rather than every time the user moves the slider). We could do pre-compute the proximity a single time on the backend and send it along with the data. Is there an advantage to computing this multiple times on the front end?
This could be implemented in one API call that is separate from the calls to retrieve data:
Here's an example input
https://api.datanator.info/taxon/get_distance/
?ref_tax_id=562 # Escherichia coli
&alt_tax_id=1314 # Streptococcus pyogenes
&alt_tax_id=1423 # Bacillus subtilis
...
and an example output
{
1314: <distance from 562 to 1314>,
1423: <distance from 562 to 1423>,
...
}
The JavaScript logic can be encapsulated into the TaxonomyFilter
. This can be executed asynchronously and display a spinner until the Taxonomic distances and filtering are available for the user.
This design has these benefits:
If we implement the filtering this way, the additional changes may be necessary
I don't think that the Javascript logic can be fully encapsulated into TaxonomyFilter.
The filter needs there to be a corresponding column in AG-Grid that has the information of the taxonomic proximity (this is how it filters by each individual row, right now it is called "taxonomic proximity"). If we would implement it with this method, it would need to update the grid as well (https://www.ag-grid.com/javascript-grid-data-update/).
I wanted to modularize so we could automatically support taxonomic filtering as more data types are incorporated. The API has been changed. Currently, it functions as such:
Although including the distance information with the result is going to reduce the number of times the API gets called (as we are only including the canonical ranks, this particular API is going to be called 8 times max) to 1, calculating distance between two organisms is a much more complex procedure because it requires the identification of the common ancestor first. In terms of responsiveness, I believe this is a better approach.
@yosefdroth Correct, the logic needs to be encapsulated into DataTable as a post-processing step in formatData
that transforms the result of this.props["format-data"](rawData)
.
@lzy7071 It would helpful to have an endpoint that returns distances rather than is_child information.
I'll write a separate API for distance information.
/taxon/canon_rank_common_distance/?org_0={}&org_1={} will return an object in the format of {'org_0': distance_0, 'org_1': distance_1}, where the distances are the distance between the organism and their canonically-ranked common ancestor.
/kinlaw_entry/?entry_id={}&target_organism={}&last_id={} returns documents with taxon distance information. The field name is "taxon_distance." The distance here is consistent with the distance used in protein and metabolite categories.
/rna/halflife/get_info_by_protein_name/?protein_name={}&_from=0&size=10&taxon_distance=true&ncbi_taxonomy_id={} returns documents with taxon distance nested in subdocuments of halflives. The field name is "taxon_distance"
Thanks! For reactions, would you be able to add the functionality to the taxon distance information to /reactions/kinlaw_by_name/ as well? The frontend currently uses /reactions/kinlaw_by_name/ because it makes the calls based on substrate and product names, rather than the reaction ID.
For RNA half life, can the call accept the name of the species, rather than the NCBI ID? When the user inputs the name of an organism, the frontend does not know the NCBI ID of the organism, and will have no way to format the URL necessary for the current API call.
Thanks! For reactions, would you be able to add the functionality to the taxon distance information to /reactions/kinlaw_by_name/ as well? The frontend currently uses /reactions/kinlaw_by_name/ because it makes the calls based on substrate and product names, rather than the reaction ID.
Done.
For RNA half life, can the call accept the name of the species, rather than the NCBI ID? When the user inputs the name of an organism, the frontend does not know the NCBI ID of the organism, and will have no way to format the URL necessary for the current API call.
Done.
How is the taxonomic distance being calculated? I think there may be an error
I am checking for ATP + AMP => ADP in Homo sapiens. It gives Gallus gallus a taxon_distance of 16, but gives Escherichia coli a taxon_distance of 7. E coli should be ranked as farther away.
How is the taxonomic distance being calculated? I think there may be an error
I am checking for ATP + AMP => ADP in Homo sapiens. It gives Gallus gallus a taxon_distance of 16, but gives Escherichia coli a taxon_distance of 7. E coli should be ranked as farther away.
They were calculated using the same get_common_ancestor method used in front_end_query, which is the method used by metabolites API. The reason Gallus Gallus has what appears to be a longer distance is because there are non-canonical ranks between "Gallus Gallus" and the root of the tree. If you need only canonically ranked distance, a different method should be used.
Can the backend return the number of nodes from the target organism to the lowest common ancestor (regardless of the number of nodes from the lowest common ancestor to the experimental organism)?
For example. with:
It should return 2.
If we return the total path, the frontend has no way to figure out what's actually more closely related (e.g. the frontend won't know Gallus is closer than E coli.)
Can it also only return the canonically ranked nodes? Meaning, any non-canonical node between the target and the parent should not be counted. If the lowest common node is non-canonical itself, then it should be rounded up.
In that case, the method use by front_end_query should also be changed then. It has worked so far probably because "E. Coli" and "S. cere" happen to have only canonically ranked ancestors.
You are right, that one needs to be changed too. This is to allow the "taxonomic similarity" column to use names like "Genus" and "Species" rather than the actual number. At the moment, there is no way to convert the number to a rank.
Same calls, only "taxon_distance" is now an object. The information contained should be fairly self-explanatory. The metabolite documents each now has an additional object named "canon_taxon_distance" with the same information, which can be used for taxon filters in metabolite category.
For the metabolite calls, the canon rank does not find a common rank between species. I think it is a bug. For example:
In both escherichia coli and yeast, it returns a canon_taxon_distance of -1 because "No common ancestor"
The ancestors for homo sapiens
with taxon id 9606
are: 0: "cellular organisms" 1: "Eukaryota" 2 : "Opisthokonta" 3 : "Metazoa" 4 : "Eumetazoa" 5 : "Bilateria" 6 : "Deuterostomia" 7 : "Chordata" 8 : "Craniata" 9 : "Vertebrata" 10 : "Gnathostomata" 11 : "Teleostomi" 12 : "Euteleostomi" 13 : "Sarcopterygii" 14 : "Dipnotetrapodomorpha" 15 : "Tetrapoda" 16 : "Amniota" 17 : "Mammalia" 18 : "Theria" 19 : "Eutheria" 20 : "Boreoeutheria" 21 : "Euarchontoglires" 22 : "Primates" 23 : "Haplorrhini" 24 : "Simiiformes" 25 : "Catarrhini" 26 : "Hominoidea" 27 : "Hominidae" 28 : "Homininae" 29 : "Homo
The ancestors for Escherichia coli
with taxon id 562
are 0 : "cellular organisms" 1 : "Bacteria" 2 : "Proteobacteria" 3 : "Gammaproteobacteria" 4 : "Enterobacterales" 5 : "Enterobacteriaceae" 6 : "Escherichia"
,
The ancestors for saccharomyces cerevisiae
with taxon id 4932
are: 0 : "cellular organisms" 1 : "Eukaryota" 2 : "Opisthokonta" 3 : "Fungi" 4 : "Dikarya" 5 : "Ascomycota" 6 : "saccharomyceta" 7 : "Saccharomycotina" 8 : "Saccharomycetes" 9 : "Saccharomycetales" 10 : "Saccharomycetaceae" 11 : "Saccharomyces"
as such, the closest common ancestor between homo sapiens
and E. coli
is cellular organism
, which has a rank of no rank
, not a canonical rank. Therefore E. Coli
and homo sapiens
have no canonically-ranked common ancestor.
The closest common ancestor between homo sapiens
and Saccharomyces cerevisiae
is Opisthokonta
, which has a rank of no rank
, also not a canonical rank. Therefore S. cerevisiae
and homo sapiens
have no canonically-ranked common ancestor.
There is also -1 between Saccharomyces cerevisiae and homo sapiens:
canon_taxon_distance | |
---|---|
Saccharomyces cerevisiae | -1 |
homo sapiens | -1 |
reason | "No common ancestor" |
That value should be a positive number because there are both Eukaryotes. It should round up to the nearest canonical rank. Otherwise there is no way to distinguish yeast from E coli (even though yeast is more closely related to homo sapiens)
There is also -1 between Saccharomyces cerevisiae and homo sapiens:
canon_taxon_distance Saccharomyces cerevisiae -1 homo sapiens -1 reason "No common ancestor" That value should be a positive number because there are both Eukaryotes. It should round up to the nearest canonical rank. Otherwise there is no way to distinguish yeast from E coli (even though yeast is more closely related to homo sapiens)
I'll include superkingdom
as a canonical rank.
Can the protein taxonomic search be updated? First off, I think there is a bug with the current implementation. For example, lets say I look up phosphofructokianse for Escherichia coli using Kegg ID (K00850).
It has Salmonella listed in the second list of documents, but the taxonomic distance should be higher (the second group should only include organisms in the genus Escherichia).
Secondly, the grouping of the proteins are done by non-canonical rankings as well. Can the groups correspond to canonical ranks only?
Can the protein taxonomic search be updated? First off, I think there is a bug with the current implementation. For example, lets say I look up phosphofructokianse for Escherichia coli using Kegg ID (K00850).
It has Salmonella listed in the second list of documents, but the taxonomic distance should be higher (the second group should only include organisms in the genus Escherichia).
Secondly, the grouping of the proteins are done by non-canonical rankings as well. Can the groups correspond to canonical ranks only?
The distance here is also noncanonical. The first and second issues are essentially the same problem. I'll update the function.
There is also -1 between Saccharomyces cerevisiae and homo sapiens: canon_taxon_distance Saccharomyces cerevisiae -1 homo sapiens -1 reason "No common ancestor" That value should be a positive number because there are both Eukaryotes. It should round up to the nearest canonical rank. Otherwise there is no way to distinguish yeast from E coli (even though yeast is more closely related to homo sapiens)
I'll include
superkingdom
as a canonical rank.
Superkingdom is now included as a canonical rank. Check link to verify.
I clicked on the link. I'm still seeing the same -1 rank:
"canon_taxon_distance": { "Escherichia coli": -1, "homo sapiens": -1, "reason": "No common ancestor" },
Because E. Coli
and Homo Sapiens
have no canonically ranked common ancestors, not even up to the superkingdom level. Check S. cere
to see the change.
Can the protein taxonomic search be updated? First off, I think there is a bug with the current implementation. For example, lets say I look up phosphofructokianse for Escherichia coli using Kegg ID (K00850).
It has Salmonella listed in the second list of documents, but the taxonomic distance should be higher (the second group should only include organisms in the genus Escherichia).
Secondly, the grouping of the proteins are done by non-canonical rankings as well. Can the groups correspond to canonical ranks only?
The distance here is also noncanonical. The first and second issues are essentially the same problem. I'll update the function.
I have changed the url schema for /proteins/proximity_abundance/proximity_abundance_kegg/, the main points are that
depth
variable is removeddistance
starts at 1 because even two species are the same, they still need to take at least 1 step to the closest common ancestor.Use link to try a few combinations see if there is any error.
Thanks! This is great!
I think there may be an error:
This is K00850 for anchored with E coli. It has E coli in abundance information the third group, and a distance of 3. But it should be first. https://api.datanator.info/proteins/proximity_abundance/proximity_abundance_kegg/?kegg_id=K00850&distance=40&anchor=escherichia%20coli
When I make the anchor homo sapiens, it has the information in the second group. https://api.datanator.info/proteins/proximity_abundance/proximity_abundance_kegg/?kegg_id=K00850&distance=40&anchor=homo%20sapiens
Thanks! This is great!
I think there may be an error:
This is K00850 for anchored with E coli. It has E coli in abundance information the third group, and a distance of 3. But it should be first. https://api.datanator.info/proteins/proximity_abundance/proximity_abundance_kegg/?kegg_id=K00850&distance=40&anchor=escherichia%20coli
When I make the anchor homo sapiens, it has the information in the second group. https://api.datanator.info/proteins/proximity_abundance/proximity_abundance_kegg/?kegg_id=K00850&distance=40&anchor=homo%20sapiens
I changed the behavior so now:
E. Coli
and E coli k-12
have a distance of 2.homo sapiens
and homo sapiens
have a distance of 1.E coli to E coli k-12 should have a distance of 1. It should be distance from the target to the common node (as opposed to the observed to the common).
This makes sense because if the user wants E coli, then E coli k-12 is as related as it can possibly be (because E coli includes all the substrains).
E coli to E coli k-12 should have a distance of 1. It should be distance from the target to the common node (as opposed to the observed to the common).
This makes sense because if the user wants E coli, then E coli k-12 is as related as it can possibly be (because E coli includes all the substrains).
Done.
I may have found an error with the protein taxonomic distance.
With K00900, we do have abundance data: https://api.datanator.info/proteins/proximity_abundance/proximity_abundance_kegg/?kegg_id=K00900&distance=40
However, when the query is anchored with E coli, then the abundance data does not show up https://api.datanator.info/proteins/proximity_abundance/proximity_abundance_kegg/?kegg_id=K00900&distance=40&anchor=Escherichia%20coli
It's due to the lack of common ancestors between <E. Coli> and the organisms in the database, namely two yeast organisms.
Is there a way we can include the data? We still need to display the data.
Perhaps we could add it at the end (treat cellular life as a common ancestor).
Is there a way we can include the data? We still need to display the data.
Perhaps we could add it at the end (treat cellular life as a common ancestor).
When was the endpoint called and what was it for?
I think it makes sense that E. coli
and yeasts don't share a common ancestor, which just results in no data available
when a user performs a proximity search for proteins with kegg ID K00900
in E. Coli
This is called if a user looks up 6-phosphofructo-2-kinase -- but this problem will arise any time the anchor organism is in a different superkingdom than the observed organism.
I think we should always display all the data that we have. So if a user wants the anchor organism to be E coli, we should still display all the data we have that comes from Eukaryotes, and we can leave it to the user to decide whether its relevant. Otherwise, we will end up displaying different data depending on whether an organism is inputted or not. This might add some confusion.
Even though cellular life is a non-canonical rank, E coli and Yeast do actually share a common ancesor. It was just farther back in the evolutionary past.
This is called if a user looks up 6-phosphofructo-2-kinase -- but this problem will arise any time the anchor organism is in a different superkingdom than the observed organism.
I think we should always display all the data that we have. So if a user wants the anchor organism to be E coli, we should still display all the data we have that comes from Eukaryotes, and we can leave it to the user to decide whether its relevant. Otherwise, we will end up displaying different data depending on whether an organism is inputted or not. This might add some confusion.
Even though cellular life is a non-canonical rank, E coli and Yeast do actually share a common ancesor. It was just farther back in the evolutionary past.
I have some reservations against this somewhat ad-hoc approach where we essentially changed the definition of canonical ranks by adding cellular organisms
, which is defined as no rank
, to canonical ranks. Although I agree that we should display as much data as possible, I still think if the data doesn't fit user intent, it doesn't need to be displayed.
But I did change the behavior of the endpoint so now it shows the data. Finding canonical ranks is a fairly low-level method that a lot of other methods use so please be on the look out for errors or mistakes and let me know.
https://api.datanator.info/reactions/kinlaw_by_name/?substrates=Glucose&products=Glucose&_from=0&size=1000&bound=tight