cancervariants / gene-normalization

Services and guidelines for normalizing genes
https://gene-normalizer.readthedocs.io/latest/
MIT License
1 stars 3 forks source link

Provide `starts_with` endpoint/database method #348

Open jsstevenson opened 5 months ago

jsstevenson commented 5 months ago

@katiestahl has been working on setting up autocomplete for VarCat for all of the normalizers. Broadly, we'd like something where the input is whatever the user has typed so far, and the output is a list of objects with

  1. the completed term
  2. what kind of entity the term is (e.g. symbol, alias, etc)
  3. the concept ID for the normalized concept that the completed term maps to
  4. (optionally) some sort of human-readable name for the normalized concept, e.g. the gene symbol

We could set this up pretty easily for the PostgreSQL backend with something like an ILIKE %TERM statement, but we aren't running any PostgreSQL in production so that doesn't help our immediate problem.

For DynamoDB, it's a little more complicated, and involves some combination of indexes/superfluous columns/a reworked schema. The best that I've come up with so far is a Global Secondary Index where the hash key is the "item_type" column and the sort key is the "label_and_type" column. This lets you run queries like diseases.query(IndexName="CompletionIndex", KeyConditionExpression=Key("item_type").eq("alias") & Key("label_and_type").begins_with("braf")).

^^ Note that this forces you to commit to a specific item type. If you wanted to get completions for ALL item types in one query, you'd need to create another index where the hash key is some sort of dummy column with the same value every time. I.e. diseases.query(IndexName="ItemNeutralCompletionIndex", KeyConditionExpression=Key("dummy").eq("dummy") & Key("label_and_type").begins_with("braf"))

katiestahl commented 4 months ago

if there's any way to make this like a "contains", that would be the most ideal, since most use-cases for this will actually be where what the user is searching could be in the middle of the term

jsstevenson commented 4 months ago

Yeah if you need contains or fzf, you need an indexer service like elasticsearch (or a different DB backend). In that event, it might not be necessary or possible to implement this within the normalizer code bases themselves -- maybe it could live in the API infrastructure repos or in a standalone repo

In light of that, maybe this should be closed