KarrLab / datanator_rest_api

A OAS3 compliant REST API for the Datanator integrated database
MIT License
0 stars 3 forks source link

Non-coding RNA search results should be aggregated #130

Closed jonrkarr closed 4 years ago

jonrkarr commented 4 years ago

In the example below, tRNA-Ala appears twice as two hits in the search results. This should be aggregated together so that users don't see the same result repeated twice in the search results. https://testapi.datanator.info/ftx/text_search/gene_ranked_by_ko/?query_message=trna%20ala&from_=0&size=10&fields=definition

Here's the full endpoint that the frontend is currently using. https://testapi.datanator.info/ftx/text_search/gene_ranked_by_ko/?query_message=trna%20ala&from_=0&size=10&fields=orthodb_id&fields=orthodb_name&fields=gene_name&fields=gene_name_alt&fields=gene_name_orf&fields=gene_name_oln&fields=entrez_id&fields=protein_name&fields=entry_name&fields=uniprot_id&fields=definition&fields=ec_number

jonrkarr commented 4 years ago

(Note for me) Here's the frontend URL that is affected by this issue: http://localhost:3000/search/trna%20ala/

lzy7071 commented 4 years ago

Yes. I need to fix the tokenization https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html. This is related to https://github.com/KarrLab/datanator_rest_api/issues/127#issuecomment-706275800

lzy7071 commented 4 years ago

Note to self: tRNAs are now aggregated using orthodb_id.keyword but not deployed yet (https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html). Need to see why some rRNAs are not searchable, more than likely caused by tokenization options because RNA28SN, which is also an rRNA entry, can be found using https://testapi.datanator.info/ftx/text_search/gene_ranked_by_ko/?query_message=RNA28SN&from_=0&size=10&fields=orthodb_id.

lzy7071 commented 4 years ago

I thought the issue stems from how Elasticsearch's standard tokenizer deals with . in string was different from the tokenizer for fieldtype text. After a few hours tinkering with analyzers and tokenizers and such, I realized somehow the record with orthodb_id just wasn't transferred to Elasticsearch, because https://testapi.datanator.info/ftx/text_search/gene_ranked_by_ko/?query_message=LSU5.8S&from_=0&size=10&fields=orthodb_id&fields=definition returns the proper result. https://testapi.datanator.info/ftx/text_search/gene_ranked_by_ko/?query_message=LSU5.8S&from_=0&size=10&fields=orthodb_id&fields=_id now works.

jonrkarr commented 4 years ago

Looks good. Thanks for persisting!