cancervariants / metakb

Central repository for the VICC metakb web application
MIT License
14 stars 4 forks source link

feat: add batch search by variations #361

Closed jsstevenson closed 2 weeks ago

jsstevenson commented 3 weeks ago

close #290

Current use case is variations only (e.g. in the context of a VCF), but the objects/API are structured in such a way to (relatively) smoothly accommodate other kinds of terms*.

Very naively implemented. Tune-ups in #311 should include fetching everything in a single query rather than iteratively getting every study once study IDs are acquired. I also wonder if there's anything else we can do cache-wise to optimize for those cases where a user is performing repeated lookups on things that are probably pretty closely located in the genome, such as VCFs (seqrepo already uses a big LRU cache for sequence lookups so that might be all that's necessary)

The ability to submit multiple terms for the same kind of entity raises questions about transparent management of redundancy, failed lookups, etc. I added an extra field to supply the normalized ID, if available, for each term, so that the client can tell if terms are normalizing to the same ID or if they fail to normalize at all:

  "query": {
    "variations": [
      {
        "term": "EGFR L858R",
        "normalized_id": "ga4gh:VA.S41CcMJT2bcd8R4-qXZWH1PoHWNtG2PZ"
      },
      {
        "term": "matthew cannon 2"
      },
      {
        "term": "ga4gh:VA.S41CcMJT2bcd8R4-qXZWH1PoHWNtG2PZ",
        "normalized_id": "ga4gh:VA.S41CcMJT2bcd8R4-qXZWH1PoHWNtG2PZ"
      }
    ]
  }

This also makes it possible to understand which studies correspond to each search term.