Closed jonrkarr closed 4 years ago
There are two endpoints existing for retrieving information using uniprot_ids as the query parameter:
http://api.datanator.info/proteins/precise_abundance/?uniprot_id=Q75QI0 This endpoint retrieves abundance information only
http://api.datanator.info/proteins/meta/meta_combo/?uniprot_id=Q75QI0
This endpoint retrieves everything in our database regarding protein with uniprot_id
Q75QI0 (ancestor information is currently projected out).
Are these the functions you had in mind?
This is great. I started implementing this in the frontend. Here's an example.
http://api.datanator.info/proteins/meta/meta_combo
This is perfect. I already implemented this into the frontend.
http://api.datanator.info/proteins/precise_abundance/
kegg_orthology
argument? I think this can be removed.Are there genes which don't belong to KO groups? If so, we also need a separate endpoint to retrieve half-lives of those genes.
Same question, are there genes which don't belong to KO groups?
I don't think another endpoint for individual genes is critical because all of the RNA for which we have modification data belong to KO groups.
Protein abundance
http://api.datanator.info/proteins/precise_abundance/
- Why does the endpoint need the
kegg_orthology
argument? I think this can be removed.
This was built for an either-or situation. uniprot_id
takes precedence over kegg_orthology
. But a user can also get all the abundance data for a kegg group by providing a kegg_orthology
id. If the user gives both parameters values, an error message will be returned. I have now taken the option out because it does create confusion for users.
- For consistency with the other data endpoints, the endpoint should have an optional argument for an organism, and calculate taxonomic distances to that organism.
For consistency with the KO protein abundance endpoint, it would be helpful to include these additional keys in the output:
- protein_name
- gene_name
- species_name
The endpoint https://api.datanator.info/proteins/precise_abundance/?uniprot_id=Q54JE4&taget_species=homo%20sapiens&taxon_distance=true is up now.
RNA half-lives
Are there genes which don't belong to KO groups? If so, we also need a separate endpoint to retrieve half-lives of those genes.
Protein modifications
Same question, are there genes which don't belong to KO groups?
Yes to both questions and definitely a good point! Although at the moment I think full text search only returns data in which ko_number
is not null
but I think I can modify the aggregation step to changed that. Ignore the quotes below.
Assuming documents with
ko_number == null
is returned by/ftx/text_search/gene_ranked_by_ko/
, taking protein withuniprot_id = Q75QI0
(http://api.datanator.info/ftx/text_search/num_of_index/?query_message=Q75QI0&index=protein&from_=0&size=10&fields=uniprot_id ) as an example:
User searches for the specific
uniprot_id
, the document of which has ako_number
field of the valuenull
. We can still use/proteins/precise_abundance/uniprot_id
sinceUser searches for a name, say
dehydrogenase
, ignoring the results withko_number
. All results withko_number == null
will be aggregated together by full text search (https://api.datanator.info/ftx/text_search/gene_ranked_by_ko/?query_message=dehydrogenase&from_=0&size=10&fields=ko_name&fields=ko_number&fields=gene_name&fields=gene_name_alt&fields=gene_name_orf&fields=gene_name_oln&fields=entrez_id&fields=protein_name&fields=entry_name&fields=uniprot_id&fields=ec_number).RNA modifications
I don't think another endpoint for individual genes is critical because all of the RNA for which we have modification data belong to KO groups.
Agreed.
The protein abundance endpoint is perfect. I integrated this into the frontend.
RNA half-lives
Are there genes which don't belong to KO groups? If so, we also need a separate endpoint to retrieve half-lives of those genes.
Protein modifications
Same question, are there genes which don't belong to KO groups?
Yes to both questions and definitely a good point! Although at the moment I think full text search only returns data in which
ko_number
is notnull
but I think I can modify the aggregation step to changed that. Ignore the quotes below.
/ftx/text_search/gene_ranked_by_ko/?
now returns genes with no kegg orthology id. Two scenarios:
User searches for a protein using a specific uniprot_id or sequence, e.g. https://testapi.datanator.info/ftx/text_search/gene_ranked_by_ko/?query_message=Q1HN34&from_=0&size=10&fields=uniprot_id&fields=protein_name
doc_count
in the bucket is 1
. One can just use /protein/precisse_abundance/
to proceed.Scenario 1, where doc_count
is 1
is rare. A more likely scenario is when a user searches with an ambiguous string, such as protein name: http://testapi.datanator.info/ftx/text_search/gene_ranked_by_ko/?query_message=Brain-derived%20neurotrophic%20factor%20%28BDNF%29%20%28Fragment%29&from_=0&size=10&fields=uniprot_id&fields=protein_name
doc_count
is 14901
in the top-ranked bucket (the bucket is top ranked because the document with the highest ES _score
is the highest among all the buckets).hits
array, which is the highest ranked document in the bucket, a few documents in the 14901 documents are extremely relevant. For instance, the protein with uniprot_id Q1HN34 also has the name "Brain-derived neurotrophic factor", which is the same as the protein in the highest-ranked document, Q1X703Genes
to a gene-specific page with a group of genes that are relevant to user intention, we will still need to use full-text search, but this time on a limited number of indices. An endpoint exists for this purpose: /ftx/text_search/indices_in_page/
(ko_name
filed needs to be removed, otherwise the top ranked documents with be documents with ko_name
and/or protein_name
similar to user input due to tf-idf calculations). If we decide to go with this approach, I'll need to add taxon_distance
information.|
, or some other characters that won't be likely to appear in user input, as a replacement.I'm not sure I follow scenario #2. We can discuss more tomorrow (5/27/2020).
It sounds like you're suggesting that we create another intermediate layer of pages which display records from /ftx/text_search/indices_in_page/
. I think this is an unnecessarily complex design.
I agree that some unspecific/ambiguous queries can return a large number of mostly uninformative results. Arguably, most search engines have difficultly with queries like this. I think users and accustomed to needing to try alternative queries when this occurs. I think we can expect our users will do the same. I don't think we need to do any extra work to help disambiguate such ambiguous queries.
Assign frontend_group_id
to documents without ko_number
, the value of which will be the document's uniprot_id
.
Assign frontend_group_id
to documents with ko_number
, the value of which will be the document's ko_number
.
Group by frontend_group_id
in ftx.
Include species_name
in the returned documents.
/ftx/text_search/gene_ranked_by_ko/
is working as we discussed over the meeting: genes with ko_number
is grouped (dehydrogenase), genes without ko_number
is "grouped" with uniprot_id
(Brain derived neurotropic factor BDNF Fragment).
Replaced with separate issues: #105, #106.
Endpoints that need to be extended:
The first endpoint is the most important to extend. The others are could be skipped until the future as all RNAs are identified by KO groups anyway (to my understanding).
From the perspective of the frontend, there's at least two ways to do this. Each would be equally easy to implement into the frontend.