Extend gene data endpoints to gene identified via uniprot ids rather than via KEGG orthology ids

jonrkarr commented 4 years ago

Endpoints that need to be extended:

proteins/proximity_abundance/proximity_abundance_kegg
rna/halflife/get_info_by_ko
rna/modification/get_modifications_by_ko

The first endpoint is the most important to extend. The others are could be skipped until the future as all RNAs are identified by KO groups anyway (to my understanding).

From the perspective of the frontend, there's at least two ways to do this. Each would be equally easy to implement into the frontend.

Add separate endpoints for querying by UniProt id
(a) Add a UniProt id argument to each endpoint, (b) require one of the KO number or UniProt id arguments to be provided, and (c) retrieve data as appropriate based on the supplied id -- KO number or UniProt id.

lzy7071 commented 4 years ago

There are two endpoints existing for retrieving information using uniprot_ids as the query parameter:

http://api.datanator.info/proteins/precise_abundance/?uniprot_id=Q75QI0 This endpoint retrieves abundance information only
http://api.datanator.info/proteins/meta/meta_combo/?uniprot_id=Q75QI0 This endpoint retrieves everything in our database regarding protein with uniprot_id Q75QI0 (ancestor information is currently projected out).

Are these the functions you had in mind?

jonrkarr commented 4 years ago

This is great. I started implementing this in the frontend. Here's an example.

Metadata

http://api.datanator.info/proteins/meta/meta_combo

This is perfect. I already implemented this into the frontend.

Protein abundance

http://api.datanator.info/proteins/precise_abundance/

Why does the endpoint need the kegg_orthology argument? I think this can be removed.
For consistency with the other data endpoints, the endpoint should have an optional argument for an organism, and calculate taxonomic distances to that organism.
For consistency with the KO protein abundance endpoint, it would be helpful to include these additional keys in the output:
- protein_name
- gene_name
- species_name

RNA half-lives

Are there genes which don't belong to KO groups? If so, we also need a separate endpoint to retrieve half-lives of those genes.

Protein modifications

Same question, are there genes which don't belong to KO groups?

RNA modifications

I don't think another endpoint for individual genes is critical because all of the RNA for which we have modification data belong to KO groups.

lzy7071 commented 4 years ago

Protein abundance

http://api.datanator.info/proteins/precise_abundance/

Why does the endpoint need the kegg_orthology argument? I think this can be removed.

This was built for an either-or situation. uniprot_id takes precedence over kegg_orthology. But a user can also get all the abundance data for a kegg group by providing a kegg_orthology id. If the user gives both parameters values, an error message will be returned. I have now taken the option out because it does create confusion for users.

For consistency with the other data endpoints, the endpoint should have an optional argument for an organism, and calculate taxonomic distances to that organism.

For consistency with the KO protein abundance endpoint, it would be helpful to include these additional keys in the output:

protein_name

gene_name

species_name

The endpoint https://api.datanator.info/proteins/precise_abundance/?uniprot_id=Q54JE4&taget_species=homo%20sapiens&taxon_distance=true is up now.

RNA half-lives

Are there genes which don't belong to KO groups? If so, we also need a separate endpoint to retrieve half-lives of those genes.

Protein modifications

Same question, are there genes which don't belong to KO groups?

Yes to both questions and definitely a good point! Although at the moment I think full text search only returns data in which ko_number is not null but I think I can modify the aggregation step to changed that. Ignore the quotes below.

Assuming documents with ko_number == null is returned by /ftx/text_search/gene_ranked_by_ko/, taking protein with uniprot_id = Q75QI0 (http://api.datanator.info/ftx/text_search/num_of_index/?query_message=Q75QI0&index=protein&from_=0&size=10&fields=uniprot_id ) as an example:

User searches for the specific uniprot_id, the document of which has a ko_number field of the value null. We can still use /proteins/precise_abundance/uniprot_id since

User searches for a name, say dehydrogenase, ignoring the results with ko_number. All results with ko_number == null will be aggregated together by full text search (https://api.datanator.info/ftx/text_search/gene_ranked_by_ko/?query_message=dehydrogenase&from_=0&size=10&fields=ko_name&fields=ko_number&fields=gene_name&fields=gene_name_alt&fields=gene_name_orf&fields=gene_name_oln&fields=entrez_id&fields=protein_name&fields=entry_name&fields=uniprot_id&fields=ec_number).

RNA modifications

I don't think another endpoint for individual genes is critical because all of the RNA for which we have modification data belong to KO groups.

Agreed.

jonrkarr commented 4 years ago

The protein abundance endpoint is perfect. I integrated this into the frontend.

lzy7071 commented 4 years ago

RNA half-lives

Are there genes which don't belong to KO groups? If so, we also need a separate endpoint to retrieve half-lives of those genes.

Protein modifications

Same question, are there genes which don't belong to KO groups?

Yes to both questions and definitely a good point! Although at the moment I think full text search only returns data in which ko_number is not null but I think I can modify the aggregation step to changed that. Ignore the quotes below.

/ftx/text_search/gene_ranked_by_ko/? now returns genes with no kegg orthology id. Two scenarios:

User searches for a protein using a specific uniprot_id or sequence, e.g. https://testapi.datanator.info/ftx/text_search/gene_ranked_by_ko/?query_message=Q1HN34&from_=0&size=10&fields=uniprot_id&fields=protein_name
- No confusion here as the doc_count in the bucket is 1. One can just use /protein/precisse_abundance/ to proceed.
Scenario 1, where doc_count is 1 is rare. A more likely scenario is when a user searches with an ambiguous string, such as protein name: http://testapi.datanator.info/ftx/text_search/gene_ranked_by_ko/?query_message=Brain-derived%20neurotrophic%20factor%20%28BDNF%29%20%28Fragment%29&from_=0&size=10&fields=uniprot_id&fields=protein_name
- In this scenario, doc_count is 14901 in the top-ranked bucket (the bucket is top ranked because the document with the highest ES _score is the highest among all the buckets).
- One can safely assume that the majority of the documents in the 14901 documents are irrelevant to user intentions. However, just like the object in hits array, which is the highest ranked document in the bucket, a few documents in the 14901 documents are extremely relevant. For instance, the protein with uniprot_id Q1HN34 also has the name "Brain-derived neurotrophic factor", which is the same as the protein in the highest-ranked document, Q1X703
- I think in such a scenario, to proceed from the intermediate page Genes to a gene-specific page with a group of genes that are relevant to user intention, we will still need to use full-text search, but this time on a limited number of indices. An endpoint exists for this purpose: /ftx/text_search/indices_in_page/ (ko_name filed needs to be removed, otherwise the top ranked documents with be documents with ko_name and/or protein_name similar to user input due to tf-idf calculations). If we decide to go with this approach, I'll need to add taxon_distance information.
- Even with the compromise above, we need to further compromise by limiting the size of the documents returned, which might include irrelevant entries, or worse, exclude relevant entries.
- Another question is what do we use for the URL of the gene-specific page? One idea is to use |, or some other characters that won't be likely to appear in user input, as a replacement.

jonrkarr commented 4 years ago

I'm not sure I follow scenario #2. We can discuss more tomorrow (5/27/2020).

It sounds like you're suggesting that we create another intermediate layer of pages which display records from /ftx/text_search/indices_in_page/. I think this is an unnecessarily complex design.

I agree that some unspecific/ambiguous queries can return a large number of mostly uninformative results. Arguably, most search engines have difficultly with queries like this. I think users and accustomed to needing to try alternative queries when this occurs. I think we can expect our users will do the same. I don't think we need to do any extra work to help disambiguate such ambiguous queries.

lzy7071 commented 4 years ago

Assign frontend_group_id to documents without ko_number, the value of which will be the document's uniprot_id.
Assign frontend_group_id to documents with ko_number, the value of which will be the document's ko_number.
Group by frontend_group_id in ftx.
Include species_name in the returned documents.

lzy7071 commented 4 years ago

/ftx/text_search/gene_ranked_by_ko/ is working as we discussed over the meeting: genes with ko_number is grouped (dehydrogenase), genes without ko_number is "grouped" with uniprot_id (Brain derived neurotropic factor BDNF Fragment).

jonrkarr commented 4 years ago

Replaced with separate issues: #105, #106.

KarrLab / datanator_rest_api