ga4gh-beacon / beacon-v2

Unified repository for the GA4GH Beacon v2 API standard
Creative Commons Zero v1.0 Universal
27 stars 22 forks source link

Question: AND/OR combined filter logic #100

Closed reisingerf closed 8 months ago

reisingerf commented 1 year ago

Hi all, a quick question about supported/expected filtering logic, especially for ontology terms:

Reading the documentation I understand that the default filtering logic for multiple filters is AND (and that makes total sense), but there are more complex use cases.

For example: I am interested in data from male individuals that have one of 3 diseases (which are not hierarchically related).

Currently, I'd query multiple times and do the intersection/combination of results on the client side to get to my final answer.

I was wondering if there is a defined/supported way to express a logic that combines AND with OR logic to perform more complex queries on the server side. Perhaps I just overlooked it in the docs or there's a semi-official solution?

My particular view point is with regards to beacon network and UI, with the following two opposing issues:

I have seen implementations using multiple ontology terms (space separated) per filter allowing an OR logic for terms within a filter and AND logic across filters. But I don't know if that is to be supported.

As I am fairly new to the beacon world I was hoping this has come up elsewhere and I could get some guidance/advice from more experienced people. Any and all feedback is much appreciated! Thanks!

FYI: @victorskl @anuradhawick

mbaudis commented 1 year ago

@reisingerf This is IMO one of the areas where we need an addition to the current behaviour which not necessarily means that we have to change the protocol itself. I can come up easily with 4 scenarios:

  1. develop a standard for boolean expression between filtering terms
  2. document a grouping behaviour for an OR syntax
  3. document a specific use of alternative logic to AND
  4. support a global switch for logic

For me no. 2. and 3. look more attractive than 1. In fact, we already follow 3. in bycon / Progenetix: Terms which are applied to the same parameter are automatically treated as OR, which allows

one of 3 diseases (which are not hierarchically related)

This could be documented as default if we could agree in the wider Beacon community and should be rather easy to support server side.

Option 3. (e.g. doing something like NCIT:C8383|NCIT:C9003 or with something like "$or": [...] in POST) has potential but requires changes, as would do a more extensive version (1.). I'd personally need to get strong arguments since this might run into problems with adoption/implementation which are hard to foretell. Option 4. is probably easy (we actually offer this through bycon / Progenetix ..., too) but I'm myself not convinced of its need (and yes, one always can aways dream up a specific case...).

Summary: I would love to see a call for / discussion of these (and other) scenarios with a my personal support at the start going to 3.

@tb143 @redmitry @costero-e

redmitry commented 1 year ago

Hi all,

Current specification e.g. individuals endpoint uses the notation filters=A&filters=B&filters=C (spec)

"The default serialization method is style: form and explode: true"

The reference implementation supports the notation filters=A,B,C not sure about the "exploded" one.

Note that notation filters=A,B,C is out of scope of the HTTP parameters spec. Nevertheless, it is defined as the parameter pattern in the FRC6570 {?filters}. Technically, the pattern '{?filters}{&filters*}' will result in the query ?filters=A,B,C&filters=D&filters=E as [A or B or C] and D and E.

[edit]
To do not break current implementations, pipe for "OR" is a good solution IMO

Regarding the POST queries, we can extend FilteringTerm with "either": "boolean" property, where "either": true would mean comma or 'OR'.

'filters: [{"id": "A"}, {"id": "B", "either": true}, {"id": "C", "either": true}, {"id": "D"}]'

This way we won't break too much the spec...

Best,

Dmitry

mbaudis commented 1 year ago

We have documented comma-concatenation for parameters as

Attention "List Parameters in GET Requests"

Since the direct interpretation of list parameters in queries is not supported by
some server environments (e.g. PHP, GO…), list parameters such as `start` and `end`
should be provided as **comma-concatenated** strings when using them in GET requests.

There is a reason in that some implementations do not resolve multiple instances of the same parameter into a list but only keep one value; and that comma-concatenation (or ;) is a common format for list parameters in GET. We actually allow both (i.e. first concatenate all list parameter instances and then split). I don't think there is any reference for using commas as expressing OR logic. You example above

?filters=A,B,C&filters=D&filters=E

would be interpreted by bycon as

{$and: [ {par1: A}, {par2: B}, {par3: C}, {par4: D}, {par5: E} ] }

... which is IMO correct (though as mentioned above we would use OR internally for terms hitting the same par which is not yet per spec).

But I see something similar as an option to define OR using another character than , or ; (e.g. | => %7C):

?filters=A|B|C&filters=D&filters=E

or

?filters=A|B|C,D,E

... which would correspond to my number 2. above.

Still, before introducing new syntax I'd be very much in favour for solving the most common OR example by documenting the expected behaviour (i.e.

Filter values against the same parameter in the model are expected to be treated as OR

... since AND would lead to failing queries if values not in same expansion tree).

redmitry commented 1 year ago

?filters=A,B,C&filters=D&filters=E would be interpreted by bycon as '{$and: [ {par1: A}, {par2: B}, {par3: C}, {par4: D}, {par5: E} ] }'

The same for the current BSC implementation (just unroll A,B,C)

But I see something similar as an option to define OR using another character than , or ; (e.g. | => %7C):

?filters=A|B|C&filters=D&filters=E or ?filters=A|B|C,D,E

I am OK with this convention, unfortunately no way to express it in the OpenAPI ;:-(

In the OpenAPI it is either ?filters=A&filters=B or ?fileters=A,B or ?filters=A|B So we probably should not define filters as an array (it would be one "filters" parameters with a query we parse).

anuradhawick commented 1 year ago

@redmitry and @mbaudis I have been thinking about a solution with filter groups. For example

{
  ...
  "query": {
    ...
    "filters": [
      // filter group 1 (treated as OR amongst them)
      [
         {
             "id": "MONDO:0004126”
         },
         // OR
         {
             “id”: “MONDO:0004979”
         }
      ],
      // filter group 2 (treated as OR amongst them, AND between the groups)
      // AND
      [
         {
            "id": "NCIT:C20197"
         }
      ]
    ],
    ...
  }
}

This would be easier for parsing/validating. Filter groups are ANDed and filters inside them are ORed.

This, unfortunately, is not spec compliant but can be supported alongside the specification. Detection of filter groups should not be difficult. This may be translated in GETs as follows too;

?filters=MONDO:0004126|MONDO:0004979,NCIT:C20197
redmitry commented 1 year ago

This, unfortunately, is not spec compliant but can be supported alongside the specification.

My concern is how many compatible implementations will fail parsing this. I may be biased, but I would prefer just to add a logic property instead of grouping.

mbaudis commented 1 year ago

Compatibility: Yes, and therefore I go back to my suggestion to document a practical & logical use scenario as:

While the Beacon v2 framework defines the general chaining of multiple filter values with an AND logic, filters applied against the same parameter are expected to be treated as OR[^1]

This clarification then serves the discussed scenarios and could - by agreement - directly find its way into the documentation - anybody want to set up a poll? However, since we can always invent more complex scenarios I think a design for filter grouping would be an option for another level of discussion. There are other outstanding issues - e.g. a general excluded/negated flag so this is something for v2.n (but ASAP).

[^1]: AND chained filters against the same parameter will result in a null or erroneous result:

costero-e commented 1 year ago

Hi everybody, I hope I'm not too late to the conversation. I think we can make a poll on this question as @mbaudis suggested, I will start it later today. In my opinion, think I would implement the either parameter for filters request (as @redmitry suggested) more than specifying that the first filters parameter to refer to AND and the second filters parameter to an OR. I think it would be a bit confusing to have several "filters" parameters with a specific meaning. It would make things more difficult to develop and also for users to understand its application (it's less intuitive than using an "either" parameter). On the other hand, I think @anuradhawick proposal on the GET ?filters=MONDO:0004126|MONDO:0004979,NCIT:C20197 would be clear and aligned with this "either" parameter on the POST request, so I would go for this option, too.

mbaudis commented 1 year ago

I'm just not clear how the POST would look - like this?

"filters": [
  { "id": "NCIT:C20197"},
  { $or: [
    { "id": "MONDO:0004126"},
    { "id": "MONDO:0004979"}
  ]
]

... with the root boolean being the AND (which could be modified through a global directive - separate issue).

The MongoDB syntax here (used for modelling. YMMV) could also be rewritten (though slightly different meaning - i.e. you have one filter w/ an alternative id parameter instead of alternative filters):

"filters": [
  { "id": "NCIT:C20197"},
  { "id":
    {
      $in: [
        "MONDO:0004126",
        "MONDO:0004979"
      ]
    }
  }
]
anuradhawick commented 1 year ago

The approach with $in seems to carry great potential as the spec could eventually $not-in or even more complex queries ($all, $starts-with).

I wish we'll steer well away from the mongo syntax or flavour. Because beacon spec could allow custom operators to be used in this place which could cause confusion with mongo syntax.

anuradhawick commented 1 year ago

Like in it would be great to be able define operators inside the beacon. Users may be able to fetch them from an endpoint like operators.

jrambla commented 1 year ago

Answering the original question from @reisingerf: The spec only supports AND operators. If any implementation, reference one or another, does otherwise, it is just an extension ( that I would not recommend at this point of Beacon adoption, when strickly sticking to the spec would bring clarity to the community ) Having said so, during the v2 design phase we decided to leave the "OR" discussion for later (after v2.0, I mean), as we would like to gather feedback on what implementers prefer to do. I believe that "us" (in here) could not be representative of the "average implementer", but it is very good to trigger that conversation and include all the aspects: compatibility with standards (REST, HTTP...) and easy of implementation, i.e. how servers manage the suggested syntax for "OR" expressions.

My 2 cents

jrambla commented 1 year ago

About which syntax to use, I would advocate to what HTTP spec says (if anything). Alternatively, something popular like OpenAPI could be the reference. Again, remember that we tried to avoid complex queries with groupings, etc. as we could end up asking the Beacon implementers to develop a query parser and a query engine.

mbaudis commented 1 year ago

as we could end up asking the Beacon implementers to develop a query parser and a query engine

But ... that is a given? The moment you use filters you need to be able to apply filters to different entry types which basically implies aggregation of queries / map-reduce ...

reisingerf commented 1 year ago

Sorry to chime in, I haven't been working on this so please take it as an outsider's view:

I'd personally stay away from complex query/search options. The beacon was initially and still is a look-up service, a quick way to find out if there's something potentially interesting out there. It's not too complex and does not require a huge and complex infrastructure.

As soon as you expand into a complex search you'd end up with tons off issues/questions:

I don't think you could address the needs of all clients anyway, you'd always play catch-up with new requirements/use cases. It's also the clients that knows best what they want.

To me it's more important that a client can get to the data consistently across many implementations/beacons, rather than running complex queries. If I wanted the complexity I'd probably get all the data and do it locally anyway.

Mind you, there may be a middle way to cater for most (common) use cases, but...

In the end, would more complex queries really help with beacon adoption?

jrambla commented 1 year ago

Totally agree

mbaudis commented 1 year ago

After yesterday's Beacon meeting & general agreement about this I've updated the docs to emphasize the "currently always AND (but logically process OR for terms against same parameter, w/o special syntax)".

Updated documentation:

redmitry commented 1 year ago

We can implement "plain" logic following the simple rule:

  1. NO "logic" means either open or close the AND group.
  2. "and" or "or" logic continue the group.

The java implementation now have this scheme implemented:

"filters": [
     {
         "id": "NCIT:C20197"
     },
     {
         "id": "Weight",
         "operator": ">",
         "value": "80"
     },
     {
         "id": "BMI",
         "operator": ">",
         "value": "25",
         "logic": "or"
     }
]

"NCIT:C20197" "Weight>100" |"BMI>30" => "NCIT:C20197" & ("Weight>100" | "BMI>30") The GET equivalent considers comma separated filters as "OR":

[https://] beacons.bsc.es/beacon/v2.0.0/individuals?filters=NCIT:C20197&filters=Weight>100,BMI>30

Best,

Dmitry

zykonda commented 8 months ago

@redmitry The simplification offered does not solve the need of the logical condition grouping. Below are the examples of "simplified logic" applied (using your Python encoder): statement_1 = 'A & ( B | C ) | D' print(encode(statement_1)) => #output:A B |C |D statement_2 = 'A & ( B | C | D )' print(encode(statement_2)) => #output:A B |C |D flat_statement = 'A B |C |D' print(decode(flat_statement)) => #output:A & ( B | C | D ) Notice, that statements 1 and 2 parsed into the same expression (which is incorrect). And a flat statement gets an unwanted grouping.

If the flat form is critical for the logical expression, the Polish notation should be used to encode logic and decode it on the backend.

redmitry commented 8 months ago

@zykonda

statement_1 = 'A & ( B | C ) | D'

equals '(A & ( B | C )) | D' and is unresolvable under the proposed 'flat' rules. I modified the python script to do not start automatically the group on the first param. Nevertheless, 'A & ( B | C ) | D' may be rewritten as 'D | A & ( B | C )' eq. D | (A & ( B | C )) '&D |A B |C' provides desired logic.

That's true, that proposed "flat" logic may not provide complex grouping.

Kind regards,

D.

anuradhawick commented 8 months ago

@redmitry and @mbaudis we implemented something along the lines you are discussing here. This is implemented for searching variants and associated datasets. But I have a strong feeling the approach might contribute here as well. We created a mutation format that is as follows.

Example query: (!A23403G:(C150156G:T567765G):!C789T)|!C150156G Essentially, ! translates to negation, : to AND operator and | to OR operator. Brackets perform grouping.

Our original implementation could handle single queries like A23403G and whenever we have a complex query as above, we parse it (see the code below) and get an expression as follows (!0&(1&2)&!3)|!4, each number corresponds to index of the variant query. We send this over to API as an extra attribute queryCombination with the list of variant queries indicated by each index. This is our query parser from https://github.com/aehrc/PathsBeacon/blob/dev/modules/website/beaconApp/src/app/components/main/main.component.ts#L336

  parseMutationExpression(expression: string): [string, any[]] {
    const _expression = expression.replace(/\s/g, "");
    const varRegex =
      /^((?<refName>[0-9a-zA-Z]+)-)?(?<refBases>[a-z]+)(?<start>\d+)(?<altBases>[a-z]+)/i;
    const logicRegex = /[!&:()|]/;
    const queries: any[] = [];
    let combination: string = "";
    let index: number = 0;

    for (let pos = 0; pos < _expression.length; pos++) {
      const char = _expression[pos];

      if (char.match(logicRegex)) {
        // consume the groupings (parenthesis) and logic (!|)
        combination += char == ":" ? "&" : char;
      } else {
        // consume the mutation
        const remainder = _expression.slice(pos);
        const matches: any = remainder.match(varRegex);
        // must have a match starting here else skip
        if (matches) {
          queries.push(matches.groups);
          combination += index.toString();
          // jump over the match and increment index
          pos += matches[0].length - 1;
          index += 1;
        }
      }
    }

    return [combination, queries];
  }

Our approach is backwards compatible and does not break the existing workflow. Only shift gears with the extra queryCombination attribute when needed. This is actually done by our GUI hosted at https://github.com/aehrc/PathsBeacon.

For the metadata scenario, we can use a similar approach. We do not change the filters in GET nor POST requests. Requests would look like below

api/../..?filters=SNOMED:1234,HP:1234

We just have to add an extra query param as follows

api/../..?filters=SNOMED:1234,HP:1234&combination=1:2

Note we can URL encode to include these combination characters like & or |.

Similarly for POST requests we can use a queryCombination like field to specify how the filters must be combined to form the logic. This way we do not make any breaking changes to the schema of filters attribute; it remains a plain array easy to read and parse.

This is a slight change in protocol but should add immense flexibility for the community. My colleague and lead dev in this implementation @bhosking, can chime in for extended details if needed.

Very happy to hear your thoughts.

Cheers, Anuradha

redmitry commented 8 months ago

Hi all, Proposed "flat" logic has its limitations as it supports only "and" groupings and relies on standard logic operator precedence ("and" then "or") for "ors" (e.g. "A & B | C & D" = "(A & B) | (C & D)" ). It has the limitation of not permitting complex grouping with nested brackets.

Essentially, ! translates to negation, : to AND operator and | to OR operator. Brackets perform grouping.

Introducing negation would allow standard logic transformation rules and removal of nested brackets in expense of less readable logic.

@anuradhawick In my implementation, I actually transform the "flat" logic into the "standard" one before processing: https://docs.google.com/presentation/d/12cKU3eDTWq1lkkmp101mGxMon1MdRhYxzybr2wsN7FE/edit#slide=id.g2b86397cbd2_3_105 So I get something like (!0&(1&2)&!3)|!4 with only difference that I use 'F' for all parameters (since we already have an ordered list of filters).

As I mentioned in the presentation:

Technically, both forms may be considered for as the extension for the Beacon 2.0 without breaking current spec.

My point is that the "flat" logic is an additional step that could be hard to use and we can always opt for the additional property like "filtering_query".

Cheers,

Dmitry

mbaudis commented 8 months ago

@anuradhawick , all (This is a bit OT for the specific filters topic but since variants were used in the example & the general principle applies): I just want to point out that for variant queries IMO we will move to "typed" queries (genomicSequenceQuery, genomicRangeQuery ...) so each query will be evaluated of its correct composition separately & then one can chain the queries themselves.

@redmitry Per-filter negation is OMO a given & there is an issue somewhere (also aligning this w/ Phenopackets).

Also, since all variant query parameters can be provided as GET or POST, and not with any defined order, they should be evaluated separately for correct form & then checked against the (implicit or better future explicit) query type.

&variantQueryGroups=(genomicBracketQuery=(referenceName=refseq:NC_000009.12,start=21000001-21975098,end=21967753-23000000,variantType=EFO:0030067)&&(genomicSequenceQuery=(referenceName=refseq:NC_000009.12,start=21675032,alternateBases=A))

... without checking character compatibility :-) Easier to write in JSON, with a bit more complex example (here matching cases which have either a given protein sequence variant in CDKN2A OR a combination of a CNV DEL overlapping the gene together with a specific sequence alteration in the remaining Allele - I made this up, partially...):

"g_variants": {
  "variantQueryGroups": {
    "logic": "OR",
    "groups": [
      {
        "logic": "AND",
        "variantQueries": [
          {
            "genomicBracketQuery": {
              "referenceName": "refseq:NC_000009.12",
              "start": [21000001, 21975098],
              "end": [21967753, 23000000],
              "variantMinLength": 1000,
              "variantType": "EFO:0030067"
            }
          },
          {
            "genomicSequenceQuery": {
              "referenceName": "refseq:NC_000009.12",
              "start": 21675032,
              "alternateBases": "A"
            }
          }
        ]
      },
      {
        "variantQueries": [
          {
            "aminoacidAlterationQuery": {
              "geneId": "CDKN2A",
              "aminoacidAlteration": "D153Y"
            }
          }
        ]
      }
    ]
  }
}

So, using this I just want to caution against very specific implementations ... The above example is also designed to allow the legacy parameters directly in the g_variants object.

anuradhawick commented 8 months ago

@redmitry I have some concerns about the following from your slides.

The implementation first converts the “plain” form into the standard logic template: “A B |C” => “A & (B | C)” Filters names are irrelevant as we have “filters” array. Next step is collapsing brackets into sequential form: “A & (B | C)” => “A & D” In this steps a list of matching queries is generated ($match A, $or(B,C))

The conversion - “A B |C” => “A & (B | C)” seems a bit confusing. Why would it not be (A & B ) | C? , The usual order is NOT, AND then finally OR. Why not assume A & B then confusion vanishes.

This could be complex if we had A B | C | D would this be A & (B | C) | D or A & (B | C | D)?

May be what @mbaudis has proposed with query groups will solve this problem?

redmitry commented 8 months ago

@redmitry I have some concerns about the following from your slides.

Current procedure for filters is "AND". The main idea was introducing implicit grouping: "A B C D" => "(A) & (B) & (C) & (D)" => "A & B & C & D" The definition "logic" in the filter just continues the group

The conversion - “A B |C” => “A & (B | C)” seems a bit confusing. Why would it not be (A & B ) | C? ,

Because B has no "logic" it 1. default "AND"; 2. starting a group (parenthesis). C has explicit logic so it doesn't create a group.

The usual order is NOT, AND then finally OR. Why not assume A & B then confusion vanishes.

The problem is that I want to be able to provide groping like: ( A & B ) | ( C & D ) which is default logical behavior without brackets ( == A & B | C & D ). ( A | B ) & ( C | D ) needs grouping, so according proposed "rules" is A |B C |D

  1. A has no explicit logic so start grouping & ( A
  2. B has explicit logic ("OR") so continue the group (if any). & ( A | B 3, C has no logic, so 1, default "AND"; 2 start a new group (and close previous one). & ( A | B ) & ( C
  3. D has explicit logic ("OR") so continue the group. & ( A | B ) & ( C | D 5, END we still have an open group - close it! & ( A | B ) & ( C | D ) The first logic is just stripped and actually have no sense.

This could be complex if we had A B | C | D would this be A & (B | C) | D or A & (B | C | D)?

Following the "rules" A & ( B | C | D ) - as for D explicit logic is always continue the group.

As I already said this "rules" of grouping may not provide complex grouping, A & (B | C) | D may not be implemented "as is", but &D |A B |C == D | A & ( B | C )

Since I got many questions about the proposed "logic" even from technical people, I am afraid it is hardly applicable for "normal" people... I am biasing towards adding a common "filters_logic" field like F1 & ( F2 | F3) | F4 .

Cheers,

D.

zykonda commented 8 months ago

@redmitry

Since I got many questions about the proposed "logic" even from technical people, I am afraid it is hardly applicable for "normal" people...

I agree that the need to "translate" the logic statement in a specific form can potentially confuse API users and lead to unexpected results. Plus, I'm concerned about limited applicability.

@anuradhawick Your solution/proposal is more clear for users, though I would call the new parameter queryExpression (vs queryCombination). I have a question: how do you get around the use of non-standard URL encoding for GET operation?

The proposal from @mbaudis resembles DNF: (group of expressions joined by AND) OR (group of expressions joined by AND), from our proposal to use DNF or CNF.

anuradhawick commented 8 months ago

@zykonda for GET requests, we must to use URL encoding (usually handled by web frameworks to ensure safety, etc). This should not be huge problem as POSTMAN, etc support this out of box. At the moment though, we use POST requests from our GUI. For a GET request, the query expression (A&B|C)&D the URL would be.

{{API}}/?queryExpression=(A%26B|C)%26D&filters=A,B,C,D

& is the challenge, which is encoded to %26