Closed colings86 closed 8 years ago
We may need to consider how me model the idea of entities and roles. E.g. foo@hotmail.com
is an entity as is bank account 123535
but the entities could each appear in different roles e.g. appearing as email sender
vs recipient
or payment payer
vs payee
.
If we want to summarise how entities interact and don't have any special treatment of entities/roles then the client has to either:
a) Create a role-less field for the purposes of analysis e.g. the field transactionParticipant
or
b) Provide terms for each of the roles e.g. sender:foo@hotmail.com, recipient:foo@hotmail.com etc etc
Perhaps a simpler option is to assume that the user lists the entities of interest once and separately defines the list of fields which represent roles e.g:
{
"entities":["foo@hotmail.com", "bar@hotmail.com"],
"roles": ["from", "to"]
}
The entities become the nodes in our graph and the edges are the type of line that connects entities e.g. a direction of payment. Edges would be agg buckets and could summarise interactions between a pair of entities through the use of child aggs e.g. summing the volumes of money transferred, month by month. It may be useful to do some form of "edge bundling" e.g. rolling up to
and cc
roles into a single recipient
role for the purposes of bucketing. This could be defined as part of the agg settings
In terms of the API for this aggregation at the moment I have the following:
Request:
{
"size" : 100,
"fields" : ["From", "To"],
"nodeValues" : [ "foo@example.com", "bar@example.com"]
}
Response:
{
{
"key_as_string": "From:foo@example.com\u0000To:bar@example.com",
"src": {
"field": "From",
"value": "foo@example.com"
},
"dest": {
"field": "To",
"value": "bar@example.com"
},
"doc_count": 113
},
{
"key_as_string": "From:bar@example.com\u0000To:foo@aol.com",
"src": {
"field": "From",
"value": "bar@example.com"
},
"dest": {
"field": "To",
"value": "foo@aol.com"
},
"doc_count": 80
}
}
The src and dest fields are at the moment chosen arbitrarily. An improvement might be to split the fields into sourceFields and destinationFields but this might not fit well with use cases which don't care about directed graphs
@markharwood @colings86 is this still of interest?
This is superseded by the graph API
The aggregation would take a list of terms as input (e.g. a list of email addresses) and creates buckets based upon co-occuring terms (terms which appear in the same document regardless of their relative position in the document). These buckets then represent edges between the terms and can be used to create weighted graphs (e.g. who is sending emails to whom)
This aggregation can be used in eg social nets where the terms represent uniquely identifiable entities and combined with time based aggregations can summarise interactions over time