Graph duplicate node cleansing tool

elasticmachine commented 7 years ago

Original comment by @markharwood:

In datasets like Panama papers the issue of noisy duplicate data raises its head and is a major pain. Consider the near-duplicate names in this real example: !LINK REDACTED

To assist end-users a simple Levenshtein edit-distance on the labels typically used in a graph can be used to suggest candidates for grouping. This process would run with the click of a new "link similar" button. These suggestions can be added as dotted links between related vertices which also has the effect of pulling the related vertices closer to each other in the diagram. The end user could act on these suggested links by using existing tools to select and group vertices or perhaps hitting the undo button to remove the suggestions. I had this implementation working to good effect on a demo using SwissLeaks data (pre-cursor to Panama papers).

elasticmachine commented 7 years ago

Original comment by @markharwood:

A similar requirement is to use the text labels of selected nodes as a tokenized query to match similar nodes not currently in the workspace. Using index patterns that span more than one index I have used this feature to connect people/companies/addresses in Panama papers to similar entities in an OFAC sanctions list. This provides a tool for linking entities from different datasets. Ideally any grouping actions the user takes to merge entities visually could optionally be preserved as an "alias" definition that the UI could use as a reference to benefit other users or repeat visits to the same datasets. !LINK REDACTED By using named "more like this" type queries for the labels of selected nodes we can find the most similar document (using a negative boost for existing node-terms to avoid matching what we already have). The best matching doc provides us with similar new node-terms to add to the workspace and we can see which nodes caused a match through the use of named queries so can add lines to connect the new node. Parsing the explain output also helps us understand the strength of the match from each of the query clauses so these can be used to show similarity strength in the line thickness. A dotted line is used to emphasise the difference between a hard link (panamapapers entity 1214773 is connected to panamapapers entity 10076089) and a soft link (the label of panamapapers entity 1214773 is strikingly similar to ofac entity 10725). Of course, this technique is also useful in spotting similar nodes in one dataset e.g where panama papers folks missed a link (there are many of these!) but clearly a big benefit of this soft-linking technique is spanning datasets produced independently with no common "hard" ids shared between them like the OFAC and panama papers data.

elasticmachine commented 3 years ago

Pinging @elastic/kibana-data-discovery (Team:DataDiscovery)

elasticmachine commented 1 year ago

Pinging @elastic/kibana-visualizations @elastic/kibana-visualizations-external (Team:Visualizations)

timductive commented 7 months ago

Closing this because it's not planned to be resolved in the foreseeable future. It will be tracked in our Icebox and will be re-opened if our priorities change. Feel free to re-open if you think it should be melted sooner.

elastic / kibana

Graph duplicate node cleansing tool #17885