ga4gh-beacon / beacon-v2

Unified repository for the GA4GH Beacon v2 API standard
Creative Commons Zero v1.0 Universal
22 stars 19 forks source link

`filteringTerms`: Add `excluded` flag #63

Open mbaudis opened 1 year ago

mbaudis commented 1 year ago

There is currently no way to exclude specific ontologyTerms in filter queries although this would be desirable to expand query options (especially regarding the limitation of not having Booleans) and to stay in line with Phenopackets (which has an excluded flag e.g. in PhenotypicFeature.

(There is a current workaround to use some custom filter design, e.g. pre-pending an ontologyTerm in a request with ! - but this is non-standard & only would work for individual Beacons.)

Proposal:

  OntologyFilter:
    type: object
    description: Filter results to include records that contain a specific ontology
      term.
    required:
      - id
    properties:
      id:
        type: string
        description: >-
          Term ID to be queried, using CURIE syntax where possible.
        example: HP:0002664
      includeDescendantTerms:
        type: boolean
        default: true
        description: >-
          Define if the Beacon should implement the ontology hierarchy,
          thus query the descendant terms of `id`.
      similarity:
        type: string
        enum:
          - exact
          - high
          - medium
          - low
        default: exact
        description: >-
          Allow the Beacon to return results which do not match the filter
          exactly, but do match to a certain degree of similarity. The Beacon defines
          the semantic similarity model implemented and how to apply the thresholds
          of 'high', 'medium' and 'low' similarity.
      scope:
        type: string
        description: The entry type to which the filter applies
        example: biosamples
      excluded:
        description: >-
          Flag to indicate whether the filtering term was observed or not. The default is `false`,
          _i.e._ will be selected for **existence** of the term.
        type: boolean
        default: false
mbaudis commented 2 months ago

Use case examples

All breast adenocarcinoma samples by NCIt (NCIT:C5214), excluding triple-negative (NCIT:C71732)

This is a simple use case where a diagnosis is queried w/ the explicit exclusion of one specific subtype. In principle this could be achieved by an AND query for all wanted codes at the same hierarchy level as NCIT:C71732 but this requires understanding of the alternative codes and branch structures etc. Also more Beacon-like: "answer a simple query as good as possible".

filters:
  - id: "NCIT:C5214"
  - id: "NCIT:C71732"
    excluded: true

GET:

?filters=NCIT:C5214,!NCIT:C71732

COVID19 in a non-diabetic patient with cardiovascular risk factors

This is based on an example provided inn the Phenopackets documentation.

filters:
  - id: "NCIT:C2985"                # Diabetes Mellitus
    excluded: true
  - id: "NCIT:C34830"               # Cardiomyopathy"
  - id: "NCIT:C80389"               # Chronic Kidney Disease, Stage 3"
  - id: "NCIT:C3283"                # Obesity"
  - id: "MONDO:0100096"             # COVID-19"

GET:

?filters=!NCIT:C2985,NCIT:C34830,NCIT:C80389,NCIT:C3283,MONDO:0100096
redmitry commented 2 months ago

Hi all,

Just a naïve question. May the exclusion (negation) be included in filtering logic along with AND & OR?. I mean do we need this special case for negation?

Cheers,

D.

mbaudis commented 2 months ago

@redmitry These are actually 2 questions.

May the exclusion (negation) be included in filtering logic along with AND & OR?

If Boolean queries are developed than negated filters would be included in that scenario; just treat them as another filter, with a flag.

do we need this special case for negation

In the Booleans? If allowing Boolean combinations a "normal" query (some filters connected by AND) are just a special case - so yes. It is not so much about dreaming up scenarios where this needs to happen (e.g. retrieving 2 groups of results in a single query); excluded is just a potential flag of any filter (such as similarity).

In any case adding the excluded flag as an option seems straightforward (and is IMO clearer as the procedural similarity flag...[^1]).

[^1]: One actually could re-use similarity like "similarity": "NONE" or such but this would introduce a new special option & not be in line w/ e.g. Phenopackets - so please forget I wrote this.

redmitry commented 2 months ago

In any case adding the excluded flag as an option seems straightforward (and is IMO clearer as the procedural similarity flag...

I never understood the exact meaning of similarity for practical implementation.

From the usecase I understand that we want all but triple negative breast cancers. Without exclusion we would need repeat too many identifiers one by one.

?filters=NCIT:C5214,!NCIT:C71732 looks good, but for the POST I would use generic logical query...

"query": {
    "filters": [ 
         {"id": "NCIT:C5214"}, {"id": "NCIT:C71732"}
    ],
    "filters_logic": "NCIT:C5214 & !NCIT:C71732"
}

I mean that we may use "filters_logic" (similar to @anuradhawick "queryExpression": "(A|B)&(C&D|E&F)&!G"). I would prefer to repeat identifiers for robustness...

Cheers,

D.

gsfk commented 2 months ago

If I understand correctly, the proposal is to be able to make a query and get back results where a particular ontology is absent... is that correct? But this is quite a bit different from the phenopackets excluded field, where a term is present, but marked as "excluded": "true" (ie this condition was specifically ruled out). Obviously terms that are excluded in this sense will not be missing from the data.

Is "missing" or "absent" a less confusing name for this feature? Especially for those of us already using phenopackets with beacon.

anuradhawick commented 2 months ago

@redmitry happy with your implementation, but not sure if it would extend to support includeDescendentTerms, Alphanumeric filters and so on. Especially with LLMs in horizon similarity searches will take prominence searching "Melanoma or a related condition" in a vector space. In this case it would be nice to still have filters as a JSON object with other attributes such as descendent terms, similarity LOW/HIGH etc.

But keen to see where it lands :)

@gsfk one area this approach shines is when Cancer X is covered by datasets from HPO and SNOMED ontologies. Logical expressions makes querying this magically easy. Also when there are mutually exclusive conditions like "from Australia" or "from New Zealand" kind of queries are nearly impossible in one go without some logic.

If we can achieve logic, I think we cover all the bases for "missing", etc.

dbujold commented 2 months ago

I'll echo @gsfk point, it's not clear here whether we're talking about filtering on properties that are explicitly mentioned as excluded (as Phenopackets does), or if we're talking about adding a flag that checks whether a property is absent from a given record.

mbaudis commented 2 months ago

O.k., several convincing arguments although they fall into different camps: 1. the excluded flag would be ambiguous inits use since it doesn't define an explicit test so far (on could make this the default behavior, though - which in case of the diagnosis test would require to see if the excluded one is part of the existing filters etc.) but this requires to many conditions to agree on. The 2. argument is that a full conditional query model already solves such issues.

So excluded is too specific, but an expression of a negating condition (absent, missing, excluded) seems good - unless solving this through the query language (which IMO is still something "emergent").

negated: true? ... which also is a clearer expression of the filter value's intent, less tied to the implementation.

Btw. the comment by @gsfk Gordon actually points to a potential need for a specific excluded: true solution in the PXF sense...

gsfk commented 2 months ago

Surprisingly the name "negated" still seems ambiguous to me, in the same way as above (does it mean "negate this particular term in my query" or "this term is negated in the data")? "Negated" is actually the original name of the phenopackets excluded field.

mbaudis commented 2 months ago

@gsfk While I started w/ pointing to PXF it really is different when using the term negation here. So probably negated or absent is the best choice: One expresses that the term should not be found w/o the requirement that the subject has been explicitly tested.

All: Comments on absent?!

redmitry commented 2 months ago

I implemented a possible prototype and still do not understand the difference between "excluded" and "negation".

"query": {
  "filters": [
    { "id": "NCIT:C9335" }
  ],
  "filtersLogic": "NCIT:C9335 & !NCIT:C5213"

does this query do what we expect with "excluded"?

D.

anuradhawick commented 2 months ago

@redmitry i think your solution should imply exclusion. It looks really good and intuitive.

Can we use indices in logic expression? There's a slim chance same ontology term or filter appearing in different scopes. For example "NCIT:123456" filter could be present across multiple scopes. So if it does this ontology code in filter logic becomes ambiguous.

Also complex queries like

A & B | !A & C has to repeat same term multiple times.

What are your thoughts on that?

redmitry commented 2 months ago

The scopes are defined in the filters[] array. The "filtersLogic" only defines the logic. I use "ids" (or better should be "labels") in "filtersLogic" because we may reuse the same filter in a complex query. The question is should "NCIT:C5213" (a subclass of "NCIT:C9335") be in "filters" also? By default descendant terms are enabled.

mbaudis commented 2 months ago

@redmitry Some notes:

Regarding the flag names: As Gordon pointed out, excluded has a bit of a "tested for that explicitly" meaning, stronger than "absent". So e.g. in the case of testing against male sex the !male could be interpreted as

Also, I really would prefer to separate the discussion of the filter flag from "how does this get implemented" ¯\_(ツ)_/¯

redmitry commented 2 months ago

@mbaudis

filters[0].id should be NCIT:C9335, not double-prefixed

fixed (typo).

This is a question of interpretation and what actually be generated for execution.

  1. For alphanumeric filters "! Weight > 25" is interpreted as "Weight <= 25"
  2. For ontological filters "!NCIT:C5213" "disease ne NCIT:C5213". We may include "and exists"

Note that in a query disease $in: ["NCIT:C9335", "NCIT:C4671", ...] and $ne NCIT:C5213 checks the disease field existence in the first "and" operand.

mbaudis commented 2 months ago

@redmitry My take here is:

jrambla commented 1 month ago

My 2 cents:

  1. We already have the excludedflag in phenotypicFeature schema, Its meaning is stated in PXF and "inherited" by Beacon.
  2. We don't have a syntax for expressing that excludedin the query. @tb143 we could need one
  3. I understand that we are planning to add the NOTboolean operator together with ORlogic. @tb143 please confirm
  4. If we adopt the NOT operator, we could not be able to avoid users adding them to any filter (ontology or alphanum), and we shouldn't. The user should be allowed to say ! Weight > 25" as synonym for Weight <= 25. Whatever is clear to the user should be fine to us.
  5. I guess that, in any case, some sanityzation would be required in the implementations to avoid queries like Age< 250 that will return everything or playing with boolean logic to trick the safety measures, But this should go into the implementation not in the spec.
tb143 commented 1 month ago

Following a Filters Scout discussion on this issue, I will try to summarise what we are asking for votes on (👍,👎 or leave +1,-1).

Question: Do you agree that the ontology filter query options should include an excluded flag? Beacon will reuse the Phenopackets definition of excluded, essentially "tested for, but not found/observed".

Some considerations that have emerged from following the discussion:

jrambla commented 1 month ago

+1