Behavior of API queries against properties in lists of records

bussec commented 2 years ago

The schema contains several cases in which top-level objects contain an array of objects, e.g., Subject can contain several Diagnosis records. These lower-level objects then contain several properties. Assuming the following case (ignoring ontologies for the disease_* properties for simplicity):

* Subject
  * diagnosis
    (Diagnosis record 1)
      * disease_diagnosis: "rheumatoid arthritis"
      * disease_length: "20 years"
    (Diagnosis record 2)
      * disease_diagnosis: "pancreatic ductal adenocarcinoma"
      * disease_length: "6 months"

Now imagine a user who is looking for long-term survivors of pancreatic cancer (disease duration > 3 years) issues a query to /repertoire using the following payload:

{
  "filters":{
    "op":"and",
    "content": [
      {
        "op":"=",
        "content": {
          "field":"subject.diagnosis.disease_diagnosis",
          "value":"pancreatic ductal adenocarcinoma"
        }
      },{
        "op":">",
        "content": {
          "field":"subject.diagnosis.disease_length",
          "value":"3 years"
        }
      }
    ]
  }
}

Would the example Subject record above be returned for this query or not?

bussec commented 2 years ago

There is another aspect of this issue:

We can address the issue above by implementing a rule for parsing these requests which says: "If users provides filters for two properties, which are both in lower-level object, then assume that they want to test for these properties in each lower-level object separately and get a result if the whole filter matches in any low-level object".

However, if we do this, it would not be possible anymore to search across multiple lower-level objects, which in the example above would, e.g., be a query for subjects that have been diagnosed with "rheumatoid arthritis" AND "pancreatic ductal adenocarcinoma".

bcorrie commented 2 years ago

Would the example Subject record above be returned for this query or not?

I would think so - in the above case, the filter criteria are true.

I think the problem is that we don't have any mechanism to build queries that specify individual list elements. To do this you really need an identifier for each object in the list I think???

This can be solved relatively simply (logically at least) by searching for one criteria (disease_diagnosis = pancreatic ductal adenocarcinoma), getting a list of repertoire_ids, and then searching for the second criteria (disease_length > 3 years) along with the list of repertoire_ids.

This would return no results, which is correct based on what you are searching for. Yes, it is two searches, but is accurate and correct.

This is not intuitive and is currently not documented. 8-)

bussec commented 2 years ago

Notes from the Standard Call 2022-07:

Needs to be documented
"Local" behavior, i.e., awareness of nesting, is likely to be expected by the user.
Also "global" behavior is easier to implement on the client-side.

Now added a section to the ADC API Docs via #628, please review.

schristley commented 2 years ago

Mongo does support both, so if we'd like both to be an option then we can consider enhancing the ADC query language to support. Like $elemMatch for Mongo, we could add a new operator to provide the "local" behavior.

bcorrie commented 6 months ago

@bussec can you define what you mean by "local" and "global"

I am not sure I understand what you mean by this.

bcorrie commented 6 months ago

From https://github.com/airr-community/airr-standards/pull/628#discussion_r1487645152

@bcorrie Yes, it reflects your comment, which however pre-dates our discussions on this topics in call 2022-08, which is what I was referring to. My recollection of the discussion was that "local" behavior is what most users would expect, and that therefore it is the way how the API should behave. In addition, the potential problems resulting from such an implementation with searches requiring a "global" behavior, should be documented and potential work-arounds be provided. Again, we can reverse this decision, but then this needs to be done explicitly, we should not just drop it.

Like I say, I don't get what you mean by "local" and "global"...

The bottom line is there is data in the ADC, and the ADC API queries data in a certain way currently.

This issue states that the ADC API behaviour is confusing when applied to objects that have arrays in them. The changes made through #628 are documenting this behaviour so that users understand how to accomplish specific outcomes given the current implementation.

I believe that the documentation now does that - at least to some degree??? 8-)

In addition, from a logical side, I have problems understanding why the procedure described in lines 346-348 should yield the desired outcome -- i.e., simulating "local" behavior, which would not return the example record -- as the first query would clearly return the repertoire containing this record, but so would the second query, even when filtering on those repertoires which were already returned by the first query.

Hmmm, I see what you mean. I agree that my suggested solution will not give the desired result. Might have to think about that a bit more.

bcorrie commented 6 months ago

I believe that with the current ADC API query specification, combined with the current AIRR Standard, the only way to find

"a single diagnosis record that has specific values for two fields within that single diagnosis record (e.g. disease_diagnosis.id and disease_diagnosis.disease_length)"

is to search for both fields using the ADC API and then process the JSON output to ensure that the desired values are contained in the same disease_diagnosis object.

This is not ideal for sure, but is the current state given both the current AIRR Spec and the current ADC implementation.

I believe this is what we need to document. I believe this would apply to any of the "array" objects within the specification.

Note that I am not sure practically speaking this is currently a significant issue in the ADC. Currently the iReceptor Turnkey does not allow you to load such data. The iReceptor data loader forces you to only have one entity in an array object. Where it is necessary to have more than one (e.g. pcr_target) we split these into multiple repertoires where there is only one. We have no studies where we have stored multiple diagnoses at the moment.

In looking at VDJServer, this appears to also be the case, although I thought that VDJServer had some repertoires with more than one pcr_target. @schristley can you confirm.

bcorrie commented 6 months ago

Part of the problem (at least in the case of diagnosis) is that our Diagnosis object is not particularly well defined. What we do have in some studies are diagnoses + interventions (immunization) + medical history but currently all of these are contained in our Diagnosis object. So we have not found a need, in the studies that we have curated, to have more than one Diagnosis (at least that I recall).

This will be partially resolved with #749 but I would not anticipate changing the ADC API query dramatically (adding another operator) until that event model is much farther along than it will be for v2.0.

So for v2.0 i think we probably need to document the current ADC API behaviour so the user knows what to expect but I don't anticipate having a solution for this general issue for v2.0.

schristley commented 6 months ago

In looking at VDJServer, this appears to also be the case, although I thought that VDJServer had some repertoires with more than one pcr_target. @schristley can you confirm.

I'm pretty sure we have multiples for all of the arrays with the exception of data processing.

schristley commented 6 months ago

I believe that with the current ADC API query specification, combined with the current AIRR Standard, the only way to find

"a single diagnosis record that has specific values for two fields within that single diagnosis record (e.g. disease_diagnosis.id and disease_diagnosis.disease_length)"

is to search for both fields using the ADC API and then process the JSON output to ensure that the desired values are contained in the same disease_diagnosis object.

Mongo does support queries like this using the $elemMatch operator. If we wanted the ADC to support them then we could provide a similar operator.

bcorrie commented 6 months ago

I'm pretty sure we have multiples for all of the arrays with the exception of data processing.

@schristley can you provide an example. I would like to confirm what the current behaviour is for such a search. I can't do this on any of our repositories because we don't have any such data.

bcorrie commented 6 months ago

Mongo does support queries like this using the $elemMatch operator. If we wanted the ADC to support them then we could provide a similar operator.

I think we probably should, but it isn't something that I would want to have for v2.0. I would lump this change in with implementing isa operators for ontologies (#234). Based on our discussions around AKC needs, I think we would be better served to do this well with a follow on release after v2.0 than rush something for the v2.0 release.

bussec commented 6 months ago

@bcorrie Apologies for not specifying "global/local" in more detail, I was convinced we had more text on this in the minutes, but that is obviously not the case. So, given a nested record like the one above, a search that tests at least two properties of the same (nested) object class (like Diagnosis) connected by an AND operator would exhibit

Global behavior, if it ignores the grouping of properties in multiple sub-records (i.e., instances of an nested object class) and thereby treats a nested record as flat when it comes to testing AND-connected conditions, and
Local behavior, if it restricts the testing the AND-connected conditions to the properties present in the same sub-record.

Based on this definition API had currently global behavior, while the example on top would require a local behavior to produce the desired results.

schristley commented 6 months ago

I'm pretty sure we have multiples for all of the arrays with the exception of data processing.

@schristley can you provide an example. I would like to confirm what the current behaviour is for such a search. I can't do this on any of our repositories because we don't have any such data.

Any of the paired chain (10X) studies will have multiple PCR targets, here is one:

curl https://vdjserver.org/airr/v1/repertoire/3152591611150069269-242ac117-0001-012

Eugster's study is one that I remember doing multiple sequencing for coverage and will have multiple sample entries in the array, here is one of the repertoires from it:

curl https://vdjserver.org/airr/v1/repertoire/7717850763340026346-242ac113-0001-012

And for some of the Adaptive COVID studies, there was multiple rows of metadata that got turned into multiple Diagnosis entries. These are pretty customized with lots of extra fields outside of MiAIRR, but something that might be good use cases for the new event stuff. Here's a repertoire:

curl https://vdjserver.org/airr/v1/repertoire/4114260494712237590-242ac113-0001-012

bussec commented 6 months ago

@bcorrie @schristley

I believe that with the current ADC API query specification, combined with the current AIRR Standard, the only way [...]

Please correct me if I am wrong with the following assessment, but I am hesitant to subscribe to the idea that the global behavior is an inevitable result of the current specifications, as they - besides the recently added text describing the actual behavior - do not make any definitive statements on this topic. Instead, the current behavior is a result of decisions made during the implementation by iReceptor and VDJServer in the absence of a specification how these cases should be handled. Don't get me wrong, there a probably a dozen good reasons why you made the decision this way (and also strict local behavior produces some quirks) and this is not criticizing these decisions, but it is just not the case that the specs would not have allowed for an alternative implementation.

Having said this, how difficult would it be to implement local behavior and where are the potential problems? Is this a questions of how data is handled by MongoDB, how your internal schemas look like or the code handling API requests? The second question is just out of curiosity, my basic point here would be that if you don't see any realistic chance of such a change getting implemented, we just should take it off the table completely for any v2 release.

bcorrie commented 6 months ago

@bcorrie @schristley

I believe that with the current ADC API query specification, combined with the current AIRR Standard, the only way [...]

Please correct me if I am wrong with the following assessment, but I am hesitant to subscribe to the idea that the global behavior is an inevitable result of the current specifications, as they - besides the recently added text describing the actual behavior - do not make any definitive statements on this topic. Instead, the current behavior is a result of decisions made during the implementation by iReceptor and VDJServer in the absence of a specification how these cases should be handled. Don't get me wrong, there a probably a dozen good reasons why you made the decision this way (and also strict local behavior produces some quirks) and this is not criticizing these decisions, but it is just not the case that the specs would not have allowed for an alternative implementation.

As per usual, you are correct. 8-)

What I should have said was "... with the current ADC API query specification and its current implementation, combined with the current AIRR Standard the results might be confusing"

What I do know is that the iReceptor implementation on the data in the iReceptor Turnkey based repositories will return correct results no matter what the interpretation. This is because we don't allow arrays of length > 1 when data is loaded into the iReceptor Turnkey so I don't think you can generate an ambiguous query that would give confusing results in this case.

I am not sure we know which behaviour you would get when you query VDJServer, and since it isn't specified in the documentation what the behaviour should be, it is also not wrong 8-)

Having said this, how difficult would it be to implement local behavior and where are the potential problems? Is this a questions of how data is handled by MongoDB, how your internal schemas look like or the code handling API requests? The second question is just out of curiosity, my basic point here would be that if you don't see any realistic chance of such a change getting implemented, we just should take it off the table completely for any v2 release.

I didn't mean to imply there wasn't a realistic chance of such a change getting implemented.

I don't think it would be hard to specify and implement one, the other, or both behaviours. I just don't see the likelihood of this being implemented in time for a June v2.0 release. I would prefer to have a June v2.0 release than delay the v2.0 the release to add this feature.

I think this is practical for a v2.x release. I do see that there might be some challenges around this if our Event object becomes more complicated. I also see it likely that an new Event model being something that might be in a 2.1 release and that it would make sense to do these together.

So I would suggest removing this from the AIRR 2.0 milestone and adding it to 2.1. If our Event model gets more sophisticated in a 2.1 release then we should make sure that any query implementation meets the requirements of such an Event model.

I admit I am curious to try this on VDJServer and see what happens 8-)

schristley commented 6 months ago

@bcorrie @schristley

I believe that with the current ADC API query specification, combined with the current AIRR Standard, the only way [...]

Please correct me if I am wrong with the following assessment, but I am hesitant to subscribe to the idea that the global behavior is an inevitable result of the current specifications, as they - besides the recently added text describing the actual behavior - do not make any definitive statements on this topic. Instead, the current behavior is a result of decisions made during the implementation by iReceptor and VDJServer in the absence of a specification how these cases should be handled. Don't get me wrong, there a probably a dozen good reasons why you made the decision this way (and also strict local behavior produces some quirks) and this is not criticizing these decisions, but it is just not the case that the specs would not have allowed for an alternative implementation.

You are half right and half wrong. You are half right because looking at the query language in isolation, your query is ambiguous (particular if you are thinking of an SQL data model). However you are half wrong, because the query language needs to be understood in the context of JSON documents (not SQL tables), so the default interpretation is how Mongo (or Postgresql on JSON documents if you prefer) would interpret the query (all the ADC operators are direct mappings to Mongo query operators), with the implication that the "local" behavior you want is not supported by the ADC API. But that's my opinion and I've been known to be wrong, sometimes, during rare harvest moons when the planets align... ;-D

Regardless, that's all besides the point because we'd really like to support that "local" behavior and...

Having said this, how difficult would it be to implement local behavior and where are the potential problems? Is this a questions of how data is handled by MongoDB, how your internal schemas look like or the code handling API requests? The second question is just out of curiosity, my basic point here would be that if you don't see any realistic chance of such a change getting implemented, we just should take it off the table completely for any v2 release.

It is not hard. We just need to add another operator just like Mongo. Your query stays at the default "global" interpretation because that is the most direct mapping to Mongo behavior. Let's say the new operator is the same as the Mongo operator, elemMatch, then the local query might be something like this:

{
  "filters":{
    "op":"elemMatch",
    "content": [
      {
        "op":"=",
        "content": {
          "field":"subject.diagnosis.disease_diagnosis",
          "value":"pancreatic ductal adenocarcinoma"
        }
      },{
        "op":">",
        "content": {
          "field":"subject.diagnosis.disease_length",
          "value":"3 years"
        }
      }
    ]
  }
}

But we would have to experiment about the mapping to a query. If this isn't exactly right then it might need to be something closer to this where the array field is specified:

{
  "filters":{
    "op":"elemMatch",
    "context":"subject.diagnosis",
    "content": [
      {
        "op":"=",
        "content": {
          "field":"disease_diagnosis",
          "value":"pancreatic ductal adenocarcinoma"
        }
      },{
        "op":">",
        "content": {
          "field":"disease_length",
          "value":"3 years"
        }
      }
    ]
  }
}

schristley commented 6 months ago

Mongo does support queries like this using the $elemMatch operator. If we wanted the ADC to support them then we could provide a similar operator.

I think we probably should, but it isn't something that I would want to have for v2.0. I would lump this change in with implementing isa operators for ontologies (#234). Based on our discussions around AKC needs, I think we would be better served to do this well with a follow on release after v2.0 than rush something for the v2.0 release.

I agree.

bcorrie commented 6 months ago

I have created an issue (#751) for reviewing the docs once more for the v2.0 release, given this discussion. I have also moved this to be an 2.1 issue.

airr-community / airr-standards

Behavior of API queries against properties in lists of records #623