elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.37k stars 24.55k forks source link

Geoline aggregation - add simplification option #87903

Open nickpeihl opened 2 years ago

nickpeihl commented 2 years ago

Description

Asset tracking use cases such as GPS beacons on vehicles can create a lot of geo_points. When constructing a geo_line from a high cardinality set of points, it may be helpful to reduce the result set to only the points necessary to represent the line geometry. Two line simplication algorithms to accomplish this are Ramer–Douglas–Peucker and Visvalingam-Whyatt. Demonstration of both algorithms.

The PR for the Geo Line aggregation suggested having a simplify option to accomplish this, but it was not implemented.

For an example. Let's say we are tracking a single delivery vehicle over a 2000 mile trip. The vehicle is averaging at 45 miles per hour and submits a GPS location every 10 seconds. Over the course of the entire trip the vehicle will send about 16000 locations as geo_points. Currently we can create a geo_line aggregation up to the latest 10000 geo_points. But, since the vehicle spends many minutes traveling in a straight line, we should be able to greatly reduce the number of vertices in the geo_line across the entire 16000 point data set using one of the aforementioned simplification algorithms.

nreese commented 2 years ago

Similar to https://github.com/elastic/elasticsearch/issues/87710

elasticmachine commented 2 years ago

Pinging @elastic/es-analytics-geo (Team:Analytics)

iverase commented 2 years ago

Just to make clear that is not possible to implement this feature in our current indexes in a scalable fashion.

Elasticsearch is a distributed system and points from one track can be stored in different shards that can actually be located in different cluster nodes. The only way to apply those simplification algorithms is to send all the data points of a track to the coordinator node and then apply the simplification. This approach is not scalable and it will be very easy to send a request that fills the heap of the coordinator node.

We are hoping to provide this functionality as part of the TSDB on geo_line project. Time series aggregations have the property that data is visited on chronological order for one tsid (aka track) which is exactly what we need to apply simplification while reading the points from a track. Of course, indexes will need to be created as time series indexes.

craigtaverner commented 1 year ago

The time-series version of line-implification in geo_line aggregations has been merged in https://github.com/elastic/elasticsearch/pull/94954. This issue can remain as a request to do line-simplification in non-time-series cases. However, as mentioned above, there are memory concerns with this approach. It could, perhaps, make sense if the data nodes threshold was different from the coordinately nodes threshold. For example, right now the data nodes truncate at a very high value of 10000, in order to reduce the risk of damaging results (truncating is damaging), and the coordinating node will truncate to the same threshold, but in fact a simplified line to 1000 points or even less (perhaps even 100) could be sufficient for visualization. So perhaps line-simplification in the coordinating node to a shorter line is of value.