gbif / occurrence

Occurrence store, download, search
Apache License 2.0
22 stars 15 forks source link

"NOT" predicate also excludes NULL values #244

Open ManonGros opened 3 years ago

ManonGros commented 3 years ago

This is rather counter intuitive. For example, in this download: https://www.gbif.org/occurrence/download/0233791-200613084148143, the user wanted to exclude invasive species:

{
      "type": "not",
      "predicate": {
        "type": "equals",
        "key": "ESTABLISHMENT_MEANS",
        "value": "INVASIVE",
        "matchCase": false
      }
    }

But I don't think he realises that this also excluded all the records for which the establishment mean isn't provided.

The NOT predicate should also return NULL values (https://stackoverflow.com/questions/5658457/not-equal-operator-on-null): establishmentMeans != 'INVASIVE' OR establishmentMeans IS NULL

fmendezh commented 3 years ago

Had a look into how that is translated into Elasticsearch and Hive queries and found this:

  1. Elastic returns 1,697,395,910 (include null values)

    {
    "query": {
        "bool": {
            "must_not": {
                "term": {"establishmentMeans": "INVASIVE"}
            }
        }
    }
    }
  2. Hive returns 19,821,484 (since in ANSI SQL that is translated into something with a value different to 'INVASIVE', exclude nulls )

    select count(*) from occurrence where NOT (lower(establishmentmeans) = lower('INVASIVE'));
    or
    select count(*) from occurrence where establishmentmeans != 'INVASIVE';

There's an easy but not complete fix, and it is to check if the parent predicate is a NotPredicate add the IS NOT NULL, but I guess that more complex and nested predicates need a different transformation

MattBlissett commented 1 year ago

We should probably change Hive to use the ES-like behaviour (X != 'Y' OR X IS NULL) but if there are edge cases we should consider these first.

Whatever we do, it will be important to update the documentation.