ga4gh-discovery / ga4gh-case-discovery

A framework for searching genomic data sharing services
Apache License 2.0
8 stars 5 forks source link

which operators will be supported #36

Open colinveal opened 6 years ago

colinveal commented 6 years ago

Sorry if I've missed it somewhere in the issues, but I couldn't find any discussion of which operators the api will support and where they will be included. In the examples I created for the logic models I included Cafe Variome operators within the components themselves. I don't really mind how we include or format the operators.

The types of operators in Cafe Variome include (taken from the genotype-to-phenotype query api): Text comparators: "IS" "ISNOT" "ISLIKE[CONTAINS]" "ISLIKE[BEGIN]" "ISLIKE[END]" "ISNOTLIKE"

Numerical operators: =, ≠, <, >, <=, >=

Coordinate range operators: "EXACT" - Data feature must have same start and stop values as the query "BEGIN_AT_START" - Data feature must have same start as the query "BEGIN_BETWEEN" - Data feature must have a start value that is greater than or equal to the start in the query and also is less than or equal to the stop in the query "ONLY_BEGIN_BETWEEN" - Data feature must have a start value that is greater than or equal to the start in the query and also is less than or equal to the stop in the query, and a stop value that is greater than the stop in the query "END_AT_STOP" - Data feature must have same stop as the query "END_BETWEEN" - Data feature must have a stop value that is greater than or equal to the start in the query and also is less than or equal to the stop in the query "ONLY_END_BETWEEN" - Data feature must have a stop value that is greater than or equal to the start in the query and also is less than or equal to the stop in the query, and a start value that is less than the start in the query "BEGIN_AND_END_BETWEEN" – both the start and the stop values of the Data feature must be greater than or equal to the start in the query and also be less than or equal to the stop in the query "EXCEED" - Data feature must have a lower start value and a higher stop value than the query

Relequestual commented 6 years ago

Hi @colinveal , I think we thought about this very early on, and decided it would need further thought, and should wait till after the first version.

I think we will need to consider these non boolean logic operators as component specific filters.

For example, the text comparitors you list could be applied to a given phenotype ontology code, and have different implications for filtering on the tree.

For now, I think if such oporators are needed, they may need to be done client side, post receiving the data.

It would be great to nail down real life use cases for each of these, if possible.

I'm a little concerned about raising the barrier to entry too high with coordinate range operators, however I do understand why we would need them given the problem domain.

I hadn't fully thought through the use of a variant component with coordinate ranges. What do you think would be the most common use case? For single base variants, only return those exact location matches, and for non single base variants, return any which overlap that postion? Or something else?

I feel we probably need to define this for 1.0.0.

colinveal commented 6 years ago

Hi, yes I agree this is likely for after the first version, but was thinking whether there could be a placeholder within the components (or elsewhere) that just specifies "exact" match for the first version.

I hadn't really considered how the text comparators would work with ontology codes, just plain text (which isn't necessary for the first version), but this is an interesting idea, perhaps there may be a need for ontology operators that relate to the tree.

An example for the text comparators: a subject that has "disease" "IS LIKE" "dementia" and "disease" "IS NOT" "Alzheimer's disease" (alternative "disease" "IS" "dementia" and "disease" "IS" "!Alzheimer's disease" , but then we'd have to reserve the characters for wild cards and not)

Regarding the range operators, I agree these take some time to understand and in reality the common one is "BEGIN_AND_END_BETWEEN"

The use cases we see in our cohorting studies are:

are there > 10 genotyped variants between these coordinates are there heterozygous alt alleles between these coordinates are there pathogenic variants between these coordinates

Relequestual commented 6 years ago

I think we generally want to avoid free text as much as possible, and make that a client side responsibility, mapping to codes.

In terms of how things should work for the first version, I feel the best approach is:

I don't want to define this in the JSON payload till we understand the issues better post 1.0.0. As such, it's not an "exact" match for the whole "search", and I don't want to include this on a per component basis for the first version.

Keywords prefixed with a dash are reserved, so we create a key word later of something like -comparitor or -applicator (I don't know what it should be called, maybe different things for different situations) which can be included for a component.

The value of these keys may need to be arrays, or objects, or strings, but I don't think we've got the proper time required to analysise the issue and avoid adding it now in a way which won't cause a non breaking change for the next release, which I think we want to avoid (may not be possible).

Agree / dissagree?

colinveal commented 6 years ago

sounds a plan, agree for first version