elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.2k stars 24.84k forks source link

Add Aggregation that buckets documents based on co-occurance of terms within a document #6688

Closed colings86 closed 8 years ago

colings86 commented 10 years ago

The aggregation would take a list of terms as input (e.g. a list of email addresses) and creates buckets based upon co-occuring terms (terms which appear in the same document regardless of their relative position in the document). These buckets then represent edges between the terms and can be used to create weighted graphs (e.g. who is sending emails to whom)

This aggregation can be used in eg social nets where the terms represent uniquely identifiable entities and combined with time based aggregations can summarise interactions over time

markharwood commented 10 years ago

We may need to consider how me model the idea of entities and roles. E.g. foo@hotmail.com is an entity as is bank account 123535 but the entities could each appear in different roles e.g. appearing as email sender vs recipient or payment payer vs payee. If we want to summarise how entities interact and don't have any special treatment of entities/roles then the client has to either: a) Create a role-less field for the purposes of analysis e.g. the field transactionParticipant or b) Provide terms for each of the roles e.g. sender:foo@hotmail.com, recipient:foo@hotmail.com etc etc

Perhaps a simpler option is to assume that the user lists the entities of interest once and separately defines the list of fields which represent roles e.g:

{
    "entities":["foo@hotmail.com", "bar@hotmail.com"],
    "roles": ["from", "to"]
}

The entities become the nodes in our graph and the edges are the type of line that connects entities e.g. a direction of payment. Edges would be agg buckets and could summarise interactions between a pair of entities through the use of child aggs e.g. summing the volumes of money transferred, month by month. It may be useful to do some form of "edge bundling" e.g. rolling up to and cc roles into a single recipient role for the purposes of bucketing. This could be defined as part of the agg settings

colings86 commented 10 years ago

In terms of the API for this aggregation at the moment I have the following:

Request:

{
    "size" : 100,
    "fields" : ["From", "To"],
    "nodeValues" : [ "foo@example.com", "bar@example.com"]
}

Response:

{
    {
        "key_as_string": "From:foo@example.com\u0000To:bar@example.com",
        "src": {
            "field": "From",
            "value": "foo@example.com"
        },
        "dest": {
            "field": "To",
            "value": "bar@example.com"
        },
        "doc_count": 113
    },
    {
        "key_as_string": "From:bar@example.com\u0000To:foo@aol.com",
        "src": {
            "field": "From",
            "value": "bar@example.com"
        },
        "dest": {
            "field": "To",
            "value": "foo@aol.com"
        },
        "doc_count": 80
    }
}

The src and dest fields are at the moment chosen arbitrarily. An improvement might be to split the fields into sourceFields and destinationFields but this might not fit well with use cases which don't care about directed graphs

clintongormley commented 8 years ago

@markharwood @colings86 is this still of interest?

markharwood commented 8 years ago

This is superseded by the graph API