Geohash aggregation optimization with multifields not working as expected

mveitas commented 6 years ago

Elasticsearch version (bin/elasticsearch --version): 6.0.0 (I'll check if this is an issue with 6.2.x)

JVM version (java -version): Java 8

Description of the problem including expected versus actual behavior: I spoke with @nknize at Elasticon about ways to improve the performance of Grid Aggregation and avoiding the need to compute the geohash values from the lat, lon (https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/search/aggregations/bucket/geogrid/GeoGridAggregationBuilder.java#L246). Over large datasets, this line can consume a considerable amount of processing time (my benchmarks showed about 20ns per call). Over a dataset that is 20M documents this is going to be 400ms in processing time.

One of the suggestion he made was to use multifields on the geo_point property and apply the geohash aggregation on the gisPoint.geohash

Mapping file:

curl -XPUT "http://localhost:9200/geohash_agg_test" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "location": {
      "dynamic": "false",
      "properties": {
        "type": {
          "type": "keyword"
        },
        "city": {
          "type": "keyword"
        },
        "gisPoint": {
          "type": "geo_point",
          "fields": {
            "geohash": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}'

Sample data to insert:

curl -XPOST "http://localhost:9200/geohash_agg_test/location/_bulk?refresh" -H 'Content-Type: application/json' -d'
{"index":{"_id":1}}
{"gisPoint": "52.374081,4.912350", "city": "NEMO Science Museum"}
{"index":{"_id":2}}
{"gisPoint": "52.369219,4.901618", "city": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"gisPoint": "52.371667,4.914722", "city": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"gisPoint": "51.222900,4.405200", "city": "Letterenhuis"}
{"index":{"_id":5}}
{"gisPoint": "48.861111,2.336389", "city": "Musée du Louvre"}
{"index":{"_id":6}}
{"gisPoint": "48.860000,2.327000", "city": "Musée d\"Orsay"}
'

Running the following query should be working:

curl -XPOST "http://localhost:9200/geohash_agg_test/_search?size=0" -H 'Content-Type: application/json' -d'
{
    "aggregations" : {
        "grid" : {
            "geohash_grid" : {
                "field" : "gisPoint.geohash",
                "precision" : 3
            }
        }
    }
}'

The resulting error is the following. @nknize mention that this might have been broken during the aggregation refactor

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Expected geo_point type on field [gisPoint.geohash], but got [keyword]"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "geohash_agg_test",
        "node": "5Kc3KAPOTfSZXHIMadD5cg",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Expected geo_point type on field [gisPoint.geohash], but got [keyword]"
        }
      }
    ]
  },
  "status": 400
}

javanna commented 6 years ago

cc @elastic/es-search-aggs

gmoskovicz commented 6 years ago

CC @nknize

talevy commented 6 years ago

Hi @mveitas,

Although I did not speak with you about this at Elasticon, Nick mentioned this to me afterwords and I think I understand what you are trying to do.

Currently, any multifield of a geo_point field will be computed as a geohash'd value of the geo_point field. This means your gisPoint.geohash multifield will be of type keyword and will hold the value of the geohash of gisPoint. This geohash is calculated at index-time, so that any aggregations done on it at query time would not need to calculate the geohash anew. This means that the field can be treated like a regular term and a term aggregation can be computed on this multifield with the same outcome of doing a geohash_grid aggregation on the original field.

here is a Console script that shows what I mean.

PUT geohash_agg_test
{
  "mappings": {
    "location": {
      "properties": {
        "gisPoint": {
          "type": "geo_point",
          "fields": {
            "geohash": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

POST geohash_agg_test/location/_bulk
{"index":{"_id":1}}
{"gisPoint": "52.374081,4.912350", "city": "NEMO Science Museum"}
{"index":{"_id":2}}
{"gisPoint": "52.369219,4.901618", "city": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"gisPoint": "52.371667,4.914722", "city": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"gisPoint": "51.222900,4.405200", "city": "Letterenhuis"}
{"index":{"_id":5}}
{"gisPoint": "48.861111,2.336389", "city": "Musée du Louvre"}
{"index":{"_id":6}}
{"gisPoint": "48.860000,2.327000", "city": "Musée d\"Orsay"}

GET geohash_agg_test/location/_search
{
  "size": 0,
  "aggs": {
    "my_geohash": {
      "geohash_grid": {
        "field": "gisPoint",
        "precision": 3
      }
    }
  }
}

GET geohash_agg_test/location/_search
{
  "size": 0,
  "aggs": {
    "my_hack": {
      "terms": {
        "script": {
          "source": "doc['gisPoint.geohash'].value.substring(0, params.precision)",
          "params": {
            "precision": 3
          }
        }
      }
    }
  }
}

Both the my_hack and my_geohash aggregations will result in the same bucket values.

let me know if this is what you were thinking!

mveitas commented 6 years ago

@talevy When you add any multifield property, the geohash value will be stored as you said and you can aggregate on this but you will get very fine grained buckets (each bucket will be the 12 character geohash). The only way around this is going to be using a script and performing a substring on the value which has .a runtime cost of of executing the substring.

Maybe I misunderstood what was actually happening and that ES would be able to give me the geohash based on the precision from the stored term that might be tokenized.

With document sets of 20-30M, my goal is to not have to do any computation of the geohash value at the time of the query. I think I am going to fall back to using more storage by creating 12 properties: geohash1, geohash2, geohash3, etc that represent each zoom level and then doing a terms aggregation based on the zoom level that we are looking for. Search latency is really critical as this is being used in an interactive mapping application not to mention the CPU compute that is required to do the computations.

talevy commented 6 years ago

@mveitas I see. that is one solution to avoid any extra cycles on computing the bucket value.

Do you still think something is broken with the way multifield values are expected to work on geo_point fields?

mveitas commented 6 years ago

Based on the conversation with @nkinze I thought this was a regression with the multifield value would work with geo_point fields. Looking back at the history I havent been able to point to using the within the geohash aggregation and precision to return the correct value.

talevy commented 6 years ago

@mveitas, you're right, Nick made it seem like that was not intentional, but it is how it is now, so hard to say this is a bug. Do you see anything wrong with how it works now, though?

mveitas commented 6 years ago

@talevy There is nothing wrong with how it works today as it is documented, but there is certainly room for an improvement if one is willing to make the tradeoff of using more storage space. I'll leave this up to your team to determine if there will be any improvements made. In the mean time I am going to look to do pre-computation of the geohash level values in separate fields and aggregate based on those to avoid the CPU hit on larger data sets

imotov commented 5 years ago

So, I guess the proposal here is to make GeoHash Grid aggregation to work with keyword fields that are treated as geohashes or maybe add an option to the terms aggregation to only consider first n characters of each term or add a specialized prefix aggregation that was discussed 3+ years ago in #10408. Back then we decided not to do it and offered a solution that was basically what @mveitas is planning to do - index prefixes of different length several times. Now it's even easier to implement using ingest node. Should we revisit this again?

elasticmachine commented 5 years ago

Pinging @elastic/es-analytics-geo

iverase commented 5 years ago

We discuss this issue today and we don't think doing the substring on the geohash as part of the aggregation code will bring significant improvement in respect of doing it with a script. If you are not willing to pay that price at query time, then you need to index your geohash at different levels of precision and you pay the price at index time.

elastic / elasticsearch

Geohash aggregation optimization with multifields not working as expected #28875