DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Reconsider "and" logic for range-based ES filters #1833

Open nadove-ucsc opened 4 years ago

nadove-ucsc commented 4 years ago

The contains, intersects, and within filtering operators currently combine multiple ranges using logical conjunction, whereas the is operator uses disjunction, which would appear more useful and intuitive.

This means that, for example, the filter

{
  "cellCount": {
    "within": [
      [10000,10000],
      [0,0]
    ]
  }
}

Will return nothing, since the two ranges do not overlap.

Hannes: "And I think this [use of conjunction] is a bug. For "within" this really should be "or" logic. For "intersects" and "contains" I'm not sure."

nadove-ucsc commented 4 years ago

This behavior would seem to be due to each range-based filter being split into multiple distinct filters, one for each provided range. And separate filters are (intentionally) combined using conjunction. https://github.com/DataBiosphere/azul/blob/2bc22c62f39067524870ae37a14d8f8b1891c907/src/azul/service/elasticsearch_service.py#L128

hannes-ucsc commented 4 years ago

Let D be a range in the document and R₁ and R₂ be ranges in the filter. A conjunction can expressed by combining the filter ranges:

D ⊆ R₁ ∧ D ⊆ R₂ ⟺ D ⊆ R₁ ∩ R₂

Therefore, within should be disjunctive.

I'm unsure about contains and intersects.

nadove-ucsc commented 4 years ago

Exactly the same logic applies to contains, just swapping subset for superset and intersection for union. This is most easily demonstrated visually using a number line:

image

Obviously, the document range (in red) contains all of the filter ranges (blue) iff it contains the lowest of the lower filter bounds and the highest of the upper filter bounds.

nadove-ucsc commented 4 years ago

And a similar argument applies for intersects as well: D intersects [a, b], [c, d], and [e,f] iff it intersects [b, e] (taking the lowest of the upper bounds and the highest of the lower bounds).

So I believe they should all be disjunctive.