Closed JanJakes closed 8 years ago
Can you give more information about your use-case? I would be interested to know more about the big picture to see how we can improve elasticsearch to address your needs.
Our use case might be a little bit special - we use it to store quadkeys (GeoHash is not suitable for us) and to spread items evenly on map tiles by Prefix Aggregation (currently simulated by filters). However, aggregating items by a prefix of certain length seems to be quite a legit and general use case - one might want to group items to buckets by their prefixes (see http://stackoverflow.com/questions/23067983/term-aggregation-consider-only-the-prefix-to-aggregate) which can be quite useful for any string codes or hashes that have some defined structure. A more general way to look at it is that one could use strings as any type of tree structures (binary 0110..., quadtree 01032..., etc.) and aggregate items in subtrees by their prefix. GeoHash Aggregation is a special case of a more general "string prefix aggregation".
Some more details about our use case: We use custom quadkeys instead of GeoHash since it allows as to query items on a map per-tile (as they are really displayed on the map). A simple query to get top items for a particular tile would be something like "get N items with quadkey prefix P sorted by rating R with limit L".
However, this could return many items in one particular region of the tile (they have all high rating) an we would not be able to display them all anyway because they would overlap (while other regions would be empty). To spread the items on a tile we want to group them by a prefix longer than the quadkey of the tile and get some Top Hits from each of the groups (each tile is thus partitioned in 4, 16, 64, etc. subregions depending on the prefix length). This is equivalent to GeoHash Aggregation, but it is much more general.
I'd find this very useful too. Right now I'd like to be able to aggregate on hierarchical paths such as /1/2/3 by specifying a prefix or a depth (for example a depth of 2 would be /1/2).
Any updates/ideas on this feature?
This would be useful. +1
badly need. Now the RESTful url always contain the resource id, so the analyse of a kind of user action become impossible.
Considering most people recommending best practices of url which has been widely deployed in many projects. For example:
/api/v0/user/456/setting
/user/458
/blog/7788
/commit/4578
@JanJakes @jpountz When we want to a term-agg by path.The result should looks like
{
buckets: [
{
"key": "/blog",
"doc_count": 1000
} ,
{
"key": "/user",
"doc_count": 100
}
]
}
We only find out to use a complex script do this without additional index now.
Future more, /album/3434/track/2324
this kind of url is more harder.....
@sciphilo How to specifying a depth? I did not find.
@wuranbo Depth was a theoretical suggestion , that functionality doesn't exist yet in his context afaik.
@wuranbo in the end I just loaded each path segment (path to root) flat into the index. Then my aggregation counts worked nicely. So for 1/2/3 I loaded 1/, 1/2/,1/2/3/
+1
As with most things in Elasticsearch, it is better to use an index time solution instead of a query time solution. The terms
aggregation is more efficient because it uses global ordinals rather than the raw string term, while a runtime prefix
agg would need to work with raw strings.
Preparing your data at index time can be done with the right analysis chain, eg for paths like /foo/bar/baz
you could use a path-hierarchy tokenizer. For instance:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"paths": {
"tokenizer": "path_hierarchy"
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"path": {
"type": "string",
"analyzer": "paths",
"search_analyzer": "keyword"
},
"depth": {
"type": "integer"
}
}
}
}
}
The path
field will tokenize /foo/bar/baz
as [/foo, /foo/bar /foo/bar/baz]
. The depth
field is easy to precalculate and can be used for various filters if needed (see below). Index some data:
POST my_index/my_type/_bulk
{"index":{"_id":1}}
{"path":"/foo","depth":1}
{"index":{"_id":2}}
{"path":"/foo/bar","depth":2}
{"index":{"_id":3}}
{"path":"/foo/baz","depth":2}
{"index":{"_id":4}}
{"path":"/foo/bar/one","depth":3}
{"index":{"_id":5}}
{"path":"/foo/bar/two","depth":3}
Get all documents anywhere in the tree beginning with /foo
:
GET _search
{
"query": {
"match": {
"path": "/foo"
}
}
}
Get direct children of /foo
only:
GET _search
{
"query": {
"bool": {
"must": [
{
"match": {
"path": "/foo"
}
},
{
"term": {
"depth": 2
}
}
]
}
}
}
Return the 10 paths under /foo
with the most descendants:
GET _search?size=0
{
"query": {
"match": {
"path": "/foo"
}
},
"aggs": {
"popular_descendants": {
"terms": {
"field": "path"
}
}
}
}
Return the 10 direct children of foo
with counts of all their descendants:
GET _search?size=0
{
"query": {
"match": {
"path": "/foo"
}
},
"aggs": {
"popular_children": {
"terms": {
"field": "path",
"include": "(/[^/]+){2}"
}
}
}
}
Return the 10 direct children of foo
with counts of all their descendants AND counts of their direct children (ie grandchildren):
GET _search?size=0
{
"query": {
"match": {
"path": "/foo"
}
},
"aggs": {
"popular_children": {
"terms": {
"field": "path",
"include": "(/[^/]+){2}"
},
"aggs": {
"grandchildren_counts": {
"filter": {
"term": {
"depth": 3
}
}
}
}
}
}
}
Note: the include
parameter still requires some raw text crunching. It would be more efficient still if you know the maximum number of levels you can have in your tree, and could index these docs as:
{
"level_1": "/foo",
"level_2": "/foo/bar",
"level_3": "/foo/bar/baz"
}
In the Elasticsearch API, we try to direct users towards the most efficient way to do things. I'm against adding an arbitrary prefix
aggregation because it hides the complexity of the search operation, in other words it makes it seem like the search operation should be light when really it is doing a lot of manual work behind the scenes, which could be done more efficiently at index time.
@clintongormley how to do this for quadtrees and isn't it better to provide a quadtree_grid aggregation like geohash_grid aggregation?
Example use case is drawing heatmaps with Leaflet:
For each visible tile (256 px):
I also have a configuration parameter to adjust resolution so that i may decrease resolution by 1 level and that means I have to dive 7 levels instead of 8. Then the squares will be 4 times big (2 px by 2 px instead of a single pixel). I can decrease resolution one by one until I dive 0 levels and the squares will get bigger the same way. Geohash aggregation is not that predictable, i.e. sometimes the aggregation units are squares sometimes rectangles.
Thanks
There is Prefix Query, Prefix Filter, but no Prefix Aggregation. It should be fairly easy to add since GeoHash Aggregation should internally work the same way (I guess).
Currently it can be kind of simulated by creating Filters Aggregation and specifying one Prefix Filter per each desired prefix-bucket, but it's not quite that and there might be some performance issues.
A Prefix Aggregation with a prefix length parameter would be really handy in many cases.