Add taxonomic distance information to reaction rate constants endpoint to enable taxonomic filtering

jonrkarr commented 4 years ago

https://api.datanator.info/reactions/kinlaw_by_name/?substrates=Glucose&products=Glucose&_from=0&size=1000&bound=tight

lzy7071 commented 4 years ago

Is it possible to use the REST API /taxon/is_child/?src_tax_id={}&target_tax_id={} to achieve taxonomic filtering of the results?

yosefdroth commented 4 years ago

The issue with using this call is that it would require many synchronous API calls, which would slow down the page.

With this API, the workflow would be as follows: 1) The frontend would call the API to get the total data 2) Afterwards, the frontend would have to parse through the data to find the taxonomic ID's, and would have to call the API up to 8 times (for see whether the ID is a child for each of the taxonomic nodes), per each unique taxonomic ID.

Is there a way this computing can be done on the backend?

lzy7071 commented 4 years ago

In Metabolite category, the target organism is a constant given by the user, which is Escherichia coli E1002 in the case of the screenshot. Same applies to other categories as far as taxonomic filtering goes. Screenshot from 2020-02-25 10-49-29 When the user moves the slider down to "Proteobacteria," , the table displays only the organisms that are the children of Proteobacteria. This is achieved because the list of source organisms' NCBI Taxonony IDs is known, so one only needs to use the API mentioned to check if the data point should be filtered out or retained. I can improve the utility by changing the source organism to be a list, the return value being a list of boolean values. For instance, if one wants to check whether a list of source organisms' IDs [0,1,2,3,4] is the child of target organism 5, one can use the improved API and get a hypothetical return value in the form of [True, True, False, True, False], which indicates that 0 is the child of 5, 1 is the child of 5, so on and so forth.

yosefdroth commented 4 years ago

This might work, however we only really need to do this calculation once (rather than every time the user moves the slider). We could do pre-compute the proximity a single time on the backend and send it along with the data. Is there an advantage to computing this multiple times on the front end?

jonrkarr commented 4 years ago

This could be implemented in one API call that is separate from the calls to retrieve data:

The call needs to accept the NCBI Taxonomy id of a single reference taxon (i.e. the taxon that the user wants to find information for such as 562 for Escherichia coli)
The call needs to accept a list of the NCBI Taxonomy ids of alternative taxa (i.e. the taxa in which the experimental data that we've aggregated was observed such as 1314 for Streptococcus pyogenes)
The call needs to return a dictionary of the taxonomic distance from reference taxon to each of the alternative taxa

Here's an example input

https://api.datanator.info/taxon/get_distance/
 ?ref_tax_id=562 # Escherichia coli
 &alt_tax_id=1314 # Streptococcus pyogenes
 &alt_tax_id=1423 # Bacillus subtilis
 ...

and an example output

{
   1314: <distance from 562 to 1314>,
   1423: <distance from 562 to 1423>,
   ...
}

The JavaScript logic can be encapsulated into the TaxonomyFilter. This can be executed asynchronously and display a spinner until the Taxonomic distances and filtering are available for the user.

This design has these benefits:

Only requires 1 additional API call per page. (More specifically, one call per data table.)
The query code would be modular from the methods that retrieve experimental data. This avoid complicating the API calls that retrieve data, making it easier to continue to develop them and implement more.
There should already be code for the distance calculations since the metabolite concentration endpoint already returns this information.
The JavaScript code would be modular from the handling of individual types of data. As we add more types of data, taxonomic filtering would automatically be supported.

If we implement the filtering this way, the additional changes may be necessary

We can also simplify the current metabolite concentration query. This would no longer need to compute taxonomic distances.
If the data queries aren't already doing so, they need to return the NCBI Taxonomy id corresponding to each experimental observation.

yosefdroth commented 4 years ago

I don't think that the Javascript logic can be fully encapsulated into TaxonomyFilter.

The filter needs there to be a corresponding column in AG-Grid that has the information of the taxonomic proximity (this is how it filters by each individual row, right now it is called "taxonomic proximity"). If we would implement it with this method, it would need to update the grid as well (https://www.ag-grid.com/javascript-grid-data-update/).

lzy7071 commented 4 years ago

I wanted to modularize so we could automatically support taxonomic filtering as more data types are incorporated. The API has been changed. Currently, it functions as such:

https://api.datanator.info/taxon/is_child/?src_tax_ids=A&src_tax_ids=B&src_tax_ids=C&target_tax_id=D
It will return a list of booleans to indicate if A, B, C is the child of D, e.g. [True, False, True]
I imagine each time the user moves the filter, the API will be called, with src_tax_ids being the IDs of organisms in the table and target_tax_id being the target organism the user typed in the search box. The False result in the list can be filtered out.

Although including the distance information with the result is going to reduce the number of times the API gets called (as we are only including the canonical ranks, this particular API is going to be called 8 times max) to 1, calculating distance between two organisms is a much more complex procedure because it requires the identification of the common ancestor first. In terms of responsiveness, I believe this is a better approach.

jonrkarr commented 4 years ago

@yosefdroth Correct, the logic needs to be encapsulated into DataTable as a post-processing step in formatData that transforms the result of this.props["format-data"](rawData).

@lzy7071 It would helpful to have an endpoint that returns distances rather than is_child information.

lzy7071 commented 4 years ago

I'll write a separate API for distance information.

lzy7071 commented 4 years ago

/taxon/canon_rank_common_distance/?org_0={}&org_1={} will return an object in the format of {'org_0': distance_0, 'org_1': distance_1}, where the distances are the distance between the organism and their canonically-ranked common ancestor.

lzy7071 commented 4 years ago

/kinlaw_entry/?entry_id={}&target_organism={}&last_id={} returns documents with taxon distance information. The field name is "taxon_distance." The distance here is consistent with the distance used in protein and metabolite categories.

lzy7071 commented 4 years ago

/rna/halflife/get_info_by_protein_name/?protein_name={}&_from=0&size=10&taxon_distance=true&ncbi_taxonomy_id={} returns documents with taxon distance nested in subdocuments of halflives. The field name is "taxon_distance"

yosefdroth commented 4 years ago

Thanks! For reactions, would you be able to add the functionality to the taxon distance information to /reactions/kinlaw_by_name/ as well? The frontend currently uses /reactions/kinlaw_by_name/ because it makes the calls based on substrate and product names, rather than the reaction ID.

yosefdroth commented 4 years ago

For RNA half life, can the call accept the name of the species, rather than the NCBI ID? When the user inputs the name of an organism, the frontend does not know the NCBI ID of the organism, and will have no way to format the URL necessary for the current API call.

lzy7071 commented 4 years ago

Thanks! For reactions, would you be able to add the functionality to the taxon distance information to /reactions/kinlaw_by_name/ as well? The frontend currently uses /reactions/kinlaw_by_name/ because it makes the calls based on substrate and product names, rather than the reaction ID.

Done.

For RNA half life, can the call accept the name of the species, rather than the NCBI ID? When the user inputs the name of an organism, the frontend does not know the NCBI ID of the organism, and will have no way to format the URL necessary for the current API call.

Done.

yosefdroth commented 4 years ago

How is the taxonomic distance being calculated? I think there may be an error

In this example - http://api.datanator.info/reactions/kinlaw_by_name/?substrates=atp&substrates=amp&products=adp&_from=0&size=100&bound=loose&taxon_distance=true&species=homo%20sapiens

I am checking for ATP + AMP => ADP in Homo sapiens. It gives Gallus gallus a taxon_distance of 16, but gives Escherichia coli a taxon_distance of 7. E coli should be ranked as farther away.

lzy7071 commented 4 years ago

How is the taxonomic distance being calculated? I think there may be an error

In this example - http://api.datanator.info/reactions/kinlaw_by_name/?substrates=atp&substrates=amp&products=adp&_from=0&size=100&bound=loose&taxon_distance=true&species=homo%20sapiens

I am checking for ATP + AMP => ADP in Homo sapiens. It gives Gallus gallus a taxon_distance of 16, but gives Escherichia coli a taxon_distance of 7. E coli should be ranked as farther away.

They were calculated using the same get_common_ancestor method used in front_end_query, which is the method used by metabolites API. The reason Gallus Gallus has what appears to be a longer distance is because there are non-canonical ranks between "Gallus Gallus" and the root of the tree. If you need only canonically ranked distance, a different method should be used.

yosefdroth commented 4 years ago

Can the backend return the number of nodes from the target organism to the lowest common ancestor (regardless of the number of nodes from the lowest common ancestor to the experimental organism)?

For example. with:

It should return 2.

If we return the total path, the frontend has no way to figure out what's actually more closely related (e.g. the frontend won't know Gallus is closer than E coli.)

Can it also only return the canonically ranked nodes? Meaning, any non-canonical node between the target and the parent should not be counted. If the lowest common node is non-canonical itself, then it should be rounded up.

lzy7071 commented 4 years ago

In that case, the method use by front_end_query should also be changed then. It has worked so far probably because "E. Coli" and "S. cere" happen to have only canonically ranked ancestors.

yosefdroth commented 4 years ago

You are right, that one needs to be changed too. This is to allow the "taxonomic similarity" column to use names like "Genus" and "Species" rather than the actual number. At the moment, there is no way to convert the number to a rank.

lzy7071 commented 4 years ago

Same calls, only "taxon_distance" is now an object. The information contained should be fairly self-explanatory. The metabolite documents each now has an additional object named "canon_taxon_distance" with the same information, which can be used for taxon filters in metabolite category.

yosefdroth commented 4 years ago

For the metabolite calls, the canon rank does not find a common rank between species. I think it is a bug. For example:

https://api.datanator.info/metabolites/concentration/?metabolite=Adenosine%20triphosphate&abstract=true&species=homo%20sapiens

In both escherichia coli and yeast, it returns a canon_taxon_distance of -1 because "No common ancestor"

lzy7071 commented 4 years ago

The ancestors for homo sapiens with taxon id 9606 are: 0: "cellular organisms" 1: "Eukaryota" 2 : "Opisthokonta" 3 : "Metazoa" 4 : "Eumetazoa" 5 : "Bilateria" 6 : "Deuterostomia" 7 : "Chordata" 8 : "Craniata" 9 : "Vertebrata" 10 : "Gnathostomata" 11 : "Teleostomi" 12 : "Euteleostomi" 13 : "Sarcopterygii" 14 : "Dipnotetrapodomorpha" 15 : "Tetrapoda" 16 : "Amniota" 17 : "Mammalia" 18 : "Theria" 19 : "Eutheria" 20 : "Boreoeutheria" 21 : "Euarchontoglires" 22 : "Primates" 23 : "Haplorrhini" 24 : "Simiiformes" 25 : "Catarrhini" 26 : "Hominoidea" 27 : "Hominidae" 28 : "Homininae" 29 : "Homo

The ancestors for Escherichia coli with taxon id 562 are 0 : "cellular organisms" 1 : "Bacteria" 2 : "Proteobacteria" 3 : "Gammaproteobacteria" 4 : "Enterobacterales" 5 : "Enterobacteriaceae" 6 : "Escherichia",

The ancestors for saccharomyces cerevisiae with taxon id 4932 are: 0 : "cellular organisms" 1 : "Eukaryota" 2 : "Opisthokonta" 3 : "Fungi" 4 : "Dikarya" 5 : "Ascomycota" 6 : "saccharomyceta" 7 : "Saccharomycotina" 8 : "Saccharomycetes" 9 : "Saccharomycetales" 10 : "Saccharomycetaceae" 11 : "Saccharomyces"

as such, the closest common ancestor between homo sapiens and E. coli is cellular organism, which has a rank of no rank, not a canonical rank. Therefore E. Coli and homo sapiens have no canonically-ranked common ancestor.

The closest common ancestor between homo sapiens and Saccharomyces cerevisiae is Opisthokonta, which has a rank of no rank, also not a canonical rank. Therefore S. cerevisiae and homo sapiens have no canonically-ranked common ancestor.

yosefdroth commented 4 years ago

There is also -1 between Saccharomyces cerevisiae and homo sapiens:

canon_taxon_distance
Saccharomyces cerevisiae	-1
homo sapiens	-1
reason	"No common ancestor"

That value should be a positive number because there are both Eukaryotes. It should round up to the nearest canonical rank. Otherwise there is no way to distinguish yeast from E coli (even though yeast is more closely related to homo sapiens)

lzy7071 commented 4 years ago

There is also -1 between Saccharomyces cerevisiae and homo sapiens:

canon_taxon_distance Saccharomyces cerevisiae -1 homo sapiens -1 reason "No common ancestor" That value should be a positive number because there are both Eukaryotes. It should round up to the nearest canonical rank. Otherwise there is no way to distinguish yeast from E coli (even though yeast is more closely related to homo sapiens)

I'll include superkingdom as a canonical rank.

yosefdroth commented 4 years ago

Can the protein taxonomic search be updated? First off, I think there is a bug with the current implementation. For example, lets say I look up phosphofructokianse for Escherichia coli using Kegg ID (K00850).

https://api.datanator.info/proteins/proximity_abundance/proximity_abundance_kegg/?kegg_id=K00850&distance=100&depth=100&anchor=Escherichia%20coli

It has Salmonella listed in the second list of documents, but the taxonomic distance should be higher (the second group should only include organisms in the genus Escherichia).

Secondly, the grouping of the proteins are done by non-canonical rankings as well. Can the groups correspond to canonical ranks only?

lzy7071 commented 4 years ago

Can the protein taxonomic search be updated? First off, I think there is a bug with the current implementation. For example, lets say I look up phosphofructokianse for Escherichia coli using Kegg ID (K00850).

https://api.datanator.info/proteins/proximity_abundance/proximity_abundance_kegg/?kegg_id=K00850&distance=100&depth=100&anchor=Escherichia%20coli

It has Salmonella listed in the second list of documents, but the taxonomic distance should be higher (the second group should only include organisms in the genus Escherichia).

Secondly, the grouping of the proteins are done by non-canonical rankings as well. Can the groups correspond to canonical ranks only?

The distance here is also noncanonical. The first and second issues are essentially the same problem. I'll update the function.

lzy7071 commented 4 years ago

There is also -1 between Saccharomyces cerevisiae and homo sapiens: canon_taxon_distance Saccharomyces cerevisiae -1 homo sapiens -1 reason "No common ancestor" That value should be a positive number because there are both Eukaryotes. It should round up to the nearest canonical rank. Otherwise there is no way to distinguish yeast from E coli (even though yeast is more closely related to homo sapiens)

I'll include superkingdom as a canonical rank.

Superkingdom is now included as a canonical rank. Check link to verify.

yosefdroth commented 4 years ago

I clicked on the link. I'm still seeing the same -1 rank:

"canon_taxon_distance": { "Escherichia coli": -1, "homo sapiens": -1, "reason": "No common ancestor" },

lzy7071 commented 4 years ago

Because E. Coli and Homo Sapiens have no canonically ranked common ancestors, not even up to the superkingdom level. Check S. cere to see the change.

lzy7071 commented 4 years ago

Can the protein taxonomic search be updated? First off, I think there is a bug with the current implementation. For example, lets say I look up phosphofructokianse for Escherichia coli using Kegg ID (K00850).

https://api.datanator.info/proteins/proximity_abundance/proximity_abundance_kegg/?kegg_id=K00850&distance=100&depth=100&anchor=Escherichia%20coli

It has Salmonella listed in the second list of documents, but the taxonomic distance should be higher (the second group should only include organisms in the genus Escherichia).

Secondly, the grouping of the proteins are done by non-canonical rankings as well. Can the groups correspond to canonical ranks only?

The distance here is also noncanonical. The first and second issues are essentially the same problem. I'll update the function.

I have changed the url schema for /proteins/proximity_abundance/proximity_abundance_kegg/, the main points are that

depth variable is removed
distance starts at 1 because even two species are the same, they still need to take at least 1 step to the closest common ancestor.

Use link to try a few combinations see if there is any error.

yosefdroth commented 4 years ago

Thanks! This is great!

I think there may be an error:

This is K00850 for anchored with E coli. It has E coli in abundance information the third group, and a distance of 3. But it should be first. https://api.datanator.info/proteins/proximity_abundance/proximity_abundance_kegg/?kegg_id=K00850&distance=40&anchor=escherichia%20coli

When I make the anchor homo sapiens, it has the information in the second group. https://api.datanator.info/proteins/proximity_abundance/proximity_abundance_kegg/?kegg_id=K00850&distance=40&anchor=homo%20sapiens

lzy7071 commented 4 years ago

Thanks! This is great!

I think there may be an error:

This is K00850 for anchored with E coli. It has E coli in abundance information the third group, and a distance of 3. But it should be first. https://api.datanator.info/proteins/proximity_abundance/proximity_abundance_kegg/?kegg_id=K00850&distance=40&anchor=escherichia%20coli

When I make the anchor homo sapiens, it has the information in the second group. https://api.datanator.info/proteins/proximity_abundance/proximity_abundance_kegg/?kegg_id=K00850&distance=40&anchor=homo%20sapiens

I changed the behavior so now:

E. Coli and E coli k-12 have a distance of 2.
homo sapiens and homo sapiens have a distance of 1.

yosefdroth commented 4 years ago

E coli to E coli k-12 should have a distance of 1. It should be distance from the target to the common node (as opposed to the observed to the common).

This makes sense because if the user wants E coli, then E coli k-12 is as related as it can possibly be (because E coli includes all the substrains).

lzy7071 commented 4 years ago

E coli to E coli k-12 should have a distance of 1. It should be distance from the target to the common node (as opposed to the observed to the common).

This makes sense because if the user wants E coli, then E coli k-12 is as related as it can possibly be (because E coli includes all the substrains).

Done.

yosefdroth commented 4 years ago

I may have found an error with the protein taxonomic distance.

With K00900, we do have abundance data: https://api.datanator.info/proteins/proximity_abundance/proximity_abundance_kegg/?kegg_id=K00900&distance=40

However, when the query is anchored with E coli, then the abundance data does not show up https://api.datanator.info/proteins/proximity_abundance/proximity_abundance_kegg/?kegg_id=K00900&distance=40&anchor=Escherichia%20coli

It's due to the lack of common ancestors between <E. Coli> and the organisms in the database, namely two yeast organisms.

yosefdroth commented 4 years ago

Is there a way we can include the data? We still need to display the data.

Perhaps we could add it at the end (treat cellular life as a common ancestor).

lzy7071 commented 4 years ago

Is there a way we can include the data? We still need to display the data.

Perhaps we could add it at the end (treat cellular life as a common ancestor).

When was the endpoint called and what was it for? I think it makes sense that E. coli and yeasts don't share a common ancestor, which just results in no data available when a user performs a proximity search for proteins with kegg ID K00900 in E. Coli

yosefdroth commented 4 years ago

This is called if a user looks up 6-phosphofructo-2-kinase -- but this problem will arise any time the anchor organism is in a different superkingdom than the observed organism.

I think we should always display all the data that we have. So if a user wants the anchor organism to be E coli, we should still display all the data we have that comes from Eukaryotes, and we can leave it to the user to decide whether its relevant. Otherwise, we will end up displaying different data depending on whether an organism is inputted or not. This might add some confusion.

Even though cellular life is a non-canonical rank, E coli and Yeast do actually share a common ancesor. It was just farther back in the evolutionary past.

lzy7071 commented 4 years ago

This is called if a user looks up 6-phosphofructo-2-kinase -- but this problem will arise any time the anchor organism is in a different superkingdom than the observed organism.

I think we should always display all the data that we have. So if a user wants the anchor organism to be E coli, we should still display all the data we have that comes from Eukaryotes, and we can leave it to the user to decide whether its relevant. Otherwise, we will end up displaying different data depending on whether an organism is inputted or not. This might add some confusion.

Even though cellular life is a non-canonical rank, E coli and Yeast do actually share a common ancesor. It was just farther back in the evolutionary past.

I have some reservations against this somewhat ad-hoc approach where we essentially changed the definition of canonical ranks by adding cellular organisms, which is defined as no rank, to canonical ranks. Although I agree that we should display as much data as possible, I still think if the data doesn't fit user intent, it doesn't need to be displayed. But I did change the behavior of the endpoint so now it shows the data. Finding canonical ranks is a fairly low-level method that a lot of other methods use so please be on the look out for errors or mistakes and let me know.

KarrLab / datanator_rest_api

Add taxonomic distance information to reaction rate constants endpoint to enable taxonomic filtering #37