Current use case is variations only (e.g. in the context of a VCF), but the objects/API are structured in such a way to (relatively) smoothly accommodate other kinds of terms*.
Very naively implemented. Tune-ups in #311 should include fetching everything in a single query rather than iteratively getting every study once study IDs are acquired. I also wonder if there's anything else we can do cache-wise to optimize for those cases where a user is performing repeated lookups on things that are probably pretty closely located in the genome, such as VCFs (seqrepo already uses a big LRU cache for sequence lookups so that might be all that's necessary)
The ability to submit multiple terms for the same kind of entity raises questions about transparent management of redundancy, failed lookups, etc. I added an extra field to supply the normalized ID, if available, for each term, so that the client can tell if terms are normalizing to the same ID or if they fail to normalize at all:
This also makes it possible to understand which studies correspond to each search term.
Caveat about including other kinds of entities: I can understand why we'd want it, but it potentially makes search very weird. Say you search for drug A, drug B, variation A, and variation B. Do you just include any study that includes any of those terms? (I don't know what the use case for that is, but I think it's the most intuitive result). Or do you just include studies that include variation A + drug A or drug B, or variation B + drug A or drug B? (actually a pretty reasonable thing to search for, but a bit counterintuitive to frame).
A note about response structure. As a first pass, I made it consistent with get_search_studies -- response.studies is just a list of studies. If you wanted to figure out which ones came from a given search term, you could take the normalized ID from response.queries and then filter through response.studies based on the value in study.variant.definingContext.id (for ProteinSequenceConsequences, or something else for CategoricalVariations). Alternatively, you could group response.studies into a dict where the key is the normalized ID and the value is the list of studies for that ID. However, that becomes VERY messy if you do want to support searching by multiple entity types (what would the key be, a concatenated string or something?).
close #290
Current use case is variations only (e.g. in the context of a VCF), but the objects/API are structured in such a way to (relatively) smoothly accommodate other kinds of terms*.
Very naively implemented. Tune-ups in #311 should include fetching everything in a single query rather than iteratively getting every study once study IDs are acquired. I also wonder if there's anything else we can do cache-wise to optimize for those cases where a user is performing repeated lookups on things that are probably pretty closely located in the genome, such as VCFs (seqrepo already uses a big LRU cache for sequence lookups so that might be all that's necessary)
The ability to submit multiple terms for the same kind of entity raises questions about transparent management of redundancy, failed lookups, etc. I added an extra field to supply the normalized ID, if available, for each term, so that the client can tell if terms are normalizing to the same ID or if they fail to normalize at all:
This also makes it possible to understand which studies correspond to each search term.
get_search_studies
--response.studies
is just a list of studies. If you wanted to figure out which ones came from a given search term, you could take the normalized ID fromresponse.queries
and then filter throughresponse.studies
based on the value instudy.variant.definingContext.id
(for ProteinSequenceConsequences, or something else for CategoricalVariations). Alternatively, you could groupresponse.studies
into a dict where the key is the normalized ID and the value is the list of studies for that ID. However, that becomes VERY messy if you do want to support searching by multiple entity types (what would the key be, a concatenated string or something?).