elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
945 stars 24.82k forks source link

Add Prefix Aggregation #10408

Closed JanJakes closed 8 years ago

JanJakes commented 9 years ago

There is Prefix Query, Prefix Filter, but no Prefix Aggregation. It should be fairly easy to add since GeoHash Aggregation should internally work the same way (I guess).

Currently it can be kind of simulated by creating Filters Aggregation and specifying one Prefix Filter per each desired prefix-bucket, but it's not quite that and there might be some performance issues.

A Prefix Aggregation with a prefix length parameter would be really handy in many cases.

jpountz commented 9 years ago

Can you give more information about your use-case? I would be interested to know more about the big picture to see how we can improve elasticsearch to address your needs.

JanJakes commented 9 years ago

Our use case might be a little bit special - we use it to store quadkeys (GeoHash is not suitable for us) and to spread items evenly on map tiles by Prefix Aggregation (currently simulated by filters). However, aggregating items by a prefix of certain length seems to be quite a legit and general use case - one might want to group items to buckets by their prefixes (see http://stackoverflow.com/questions/23067983/term-aggregation-consider-only-the-prefix-to-aggregate) which can be quite useful for any string codes or hashes that have some defined structure. A more general way to look at it is that one could use strings as any type of tree structures (binary 0110..., quadtree 01032..., etc.) and aggregate items in subtrees by their prefix. GeoHash Aggregation is a special case of a more general "string prefix aggregation".

Some more details about our use case: We use custom quadkeys instead of GeoHash since it allows as to query items on a map per-tile (as they are really displayed on the map). A simple query to get top items for a particular tile would be something like "get N items with quadkey prefix P sorted by rating R with limit L".

However, this could return many items in one particular region of the tile (they have all high rating) an we would not be able to display them all anyway because they would overlap (while other regions would be empty). To spread the items on a tile we want to group them by a prefix longer than the quadkey of the tile and get some Top Hits from each of the groups (each tile is thus partitioned in 4, 16, 64, etc. subregions depending on the prefix length). This is equivalent to GeoHash Aggregation, but it is much more general.

sciphilo commented 9 years ago

I'd find this very useful too. Right now I'd like to be able to aggregate on hierarchical paths such as /1/2/3 by specifying a prefix or a depth (for example a depth of 2 would be /1/2).

JanJakes commented 9 years ago

Any updates/ideas on this feature?

costasovo commented 9 years ago

This would be useful. +1

wuranbo commented 9 years ago

badly need. Now the RESTful url always contain the resource id, so the analyse of a kind of user action become impossible.

Considering most people recommending best practices of url which has been widely deployed in many projects. For example:

/api/v0/user/456/setting
/user/458
/blog/7788
/commit/4578

@JanJakes @jpountz When we want to a term-agg by path.The result should looks like

{
     buckets: [
    {
       "key": "/blog",
       "doc_count": 1000
    } ,
    {
       "key": "/user",
       "doc_count": 100
    }
  ]
}

We only find out to use a complex script do this without additional index now.

Future more, /album/3434/track/2324 this kind of url is more harder.....

wuranbo commented 9 years ago

@sciphilo How to specifying a depth? I did not find.

sciphilo commented 9 years ago

@wuranbo Depth was a theoretical suggestion , that functionality doesn't exist yet in his context afaik.

sciphilo commented 9 years ago

@wuranbo in the end I just loaded each path segment (path to root) flat into the index. Then my aggregation counts worked nicely. So for 1/2/3 I loaded 1/, 1/2/,1/2/3/

whythecode commented 9 years ago

+1

clintongormley commented 8 years ago

As with most things in Elasticsearch, it is better to use an index time solution instead of a query time solution. The terms aggregation is more efficient because it uses global ordinals rather than the raw string term, while a runtime prefix agg would need to work with raw strings.

Preparing your data at index time can be done with the right analysis chain, eg for paths like /foo/bar/baz you could use a path-hierarchy tokenizer. For instance:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "paths": {
          "tokenizer": "path_hierarchy"
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "path": {
          "type": "string",
          "analyzer": "paths",
          "search_analyzer": "keyword"
        },
        "depth": {
          "type": "integer"
        }
      }
    }
  }
}

The path field will tokenize /foo/bar/baz as [/foo, /foo/bar /foo/bar/baz]. The depth field is easy to precalculate and can be used for various filters if needed (see below). Index some data:

POST my_index/my_type/_bulk
{"index":{"_id":1}}
{"path":"/foo","depth":1}
{"index":{"_id":2}}
{"path":"/foo/bar","depth":2}
{"index":{"_id":3}}
{"path":"/foo/baz","depth":2}
{"index":{"_id":4}}
{"path":"/foo/bar/one","depth":3}
{"index":{"_id":5}}
{"path":"/foo/bar/two","depth":3}

Get all documents anywhere in the tree beginning with /foo:

GET _search
{
  "query": {
    "match": {
      "path": "/foo"
    }
  }
}

Get direct children of /foo only:

GET _search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "path": "/foo"
          }
        },
        {
          "term": {
            "depth": 2
          }
        }
      ]
    }
  }
}

Return the 10 paths under /foo with the most descendants:

GET _search?size=0
{
  "query": {
    "match": {
      "path": "/foo"
    }
  },
  "aggs": {
    "popular_descendants": {
      "terms": {
        "field": "path"
      }
    }
  }
}

Return the 10 direct children of foowith counts of all their descendants:

GET _search?size=0
{
  "query": {
    "match": {
      "path": "/foo"
    }
  },
  "aggs": {
    "popular_children": {
      "terms": {
        "field": "path",
        "include": "(/[^/]+){2}"
      }
    }
  }
}

Return the 10 direct children of foowith counts of all their descendants AND counts of their direct children (ie grandchildren):

GET _search?size=0
{
  "query": {
    "match": {
      "path": "/foo"
    }
  },
  "aggs": {
    "popular_children": {
      "terms": {
        "field": "path",
        "include": "(/[^/]+){2}"
      },
      "aggs": {
        "grandchildren_counts": {
          "filter": {
            "term": {
              "depth": 3
            }
          }
        }
      }
    }
  }
}

Note: the include parameter still requires some raw text crunching. It would be more efficient still if you know the maximum number of levels you can have in your tree, and could index these docs as:

{
  "level_1": "/foo",
  "level_2": "/foo/bar",
  "level_3": "/foo/bar/baz"
}

In the Elasticsearch API, we try to direct users towards the most efficient way to do things. I'm against adding an arbitrary prefix aggregation because it hides the complexity of the search operation, in other words it makes it seem like the search operation should be light when really it is doing a lot of manual work behind the scenes, which could be done more efficiently at index time.

yavuzmester commented 7 years ago

@clintongormley how to do this for quadtrees and isn't it better to provide a quadtree_grid aggregation like geohash_grid aggregation?

Example use case is drawing heatmaps with Leaflet:

Thanks