dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.11k stars 551 forks source link

Local sparsification #1027

Open fgregg opened 2 years ago

fgregg commented 2 years ago

We are contemplating exposing an argument that should sparsify the entire pair-score edgelist (#1026).

It would be very nice if we had some principled way to do this automatically for users. @fjsj and I discussed some "global" approaches previously (https://github.com/dedupeio/dedupe/pull/834#pullrequestreview-440923077), but they didn't look to good.

Flavio also provided some links to "local" sparsification.

it would be great to see if one of these would be appropriate.

fgregg commented 2 years ago

Structure-Preserving Sparsification Methods for Social Networks is about methods that work on unweighted graphs.

They do reference Extracting the multiscale backbone of complex weighted networks approvingly as an approach for weighted networks.

fgregg commented 2 years ago

in Unsupervised Sparsification of Similarity Graphs, Gollub and Stein make the following claim (or something close to it):

Let $c(o)$ indicate the class membership of object $o$. Assume that $o_1$ and $o_2$ belong to the same class, then.

$$ \Pr(c(o_1) = c(o_2)) > \max\{\Pr(c(o1) = c(o{x \not\in \{1, 2\}})),\Pr(c(o2) = c(o{y \not\in \{1, 2\}}))\} $$

Which should imply that,

$$ s(o_1, o_2) > \max\{\mathbb{E}(s(o1, o{x \not\in \{1, 2\}})),\mathbb{E}(s(o2, o{y \not\in \{1, 2\}}))\} $$

where $s(o_i, o_j)$ is the similarity score between $o_i$ and $o_j$.

The authors then make an unexpected claim that this is equivalent to

$$ s(o_1, o_2) > \max\{s(o_1, \bar{o}),s(o_2, \bar{o})\} $$

i.e. that the similarity between two objects that belong to the same class should be larger than the similarity between either of those objects and the graph centroid.

This seems clearly untrue. Imagine a graph with only two edges. If the last proposed rule holds, the two nodes would never be put in the same class.

The more defensible rule, that

$$ s(o_1, o_2) > \max\{\mathbb{E}(s(o1, o{x \neq 1})),\mathbb{E}(s(o2, o{y \neq 2}))\} $$

has some surprising consequences. It means that the the threshold for two nodes being put in the same class should be higher for nodes that are near the center of the graph and lower for nodes that are at the periphery.

fgregg commented 2 years ago

Improving Short Text Clustering by Similarity Matrix Sparsification basically picks up at

$$ s(o_1, o_2) > \max\{\mathbb{E}(s(o1, o{x \not\in \{1, 2\}})),\mathbb{E}(s(o2, o{y \not\in \{1, 2\}}))\} $$

but adds standard deviation.

Let $u_i$ be the mean similarity between $o_i$ and other nodes and let $s_i$ be the standard deviation of those similarities.

Then, Rakib and his co-authors propose the rule that

$$ s(o_1, o_2) > \min\{u_1 + \alpha s_1, u_2 + \alpha s_2\} $$

This is basically a z-score based approach, and seems quite interesting. But, beyond setting $\alpha$ to 0, it's not clear what a principled way of setting that value would be.

Also, the shift from maximum to minimum seems hard to justify.

fgregg commented 2 years ago

ultimately, we are going to be thresholding clusters, so let's think about how we can use that fact for sparsification.

let our $t$ be our distance threshold. let's start with "average" clustering. let edge distances take a value between 0 and 1.

Pairwise

if two nodes, $i, j$ are a dense component, and the $w_{i,j} > t$ we can discard this edge early.

Triangle

three nodes, $i, j, k$ are a dense component, with weights $w{i,j}, w{i,k}, w{j,k}$, what is average value of nodes incident on $i$ such that all three records are in a cluster. Assume $w{j,k}=0$ and are already in cluster $C_{j, k}$. $C_i$ is the singleton cluster containing only node $i$.

$$ D(C_{j,k}, C_i) = u_i < t $$

(useful ref on linkage algos)

So, then what is the minimum value that a particular edge can have? Assume all other new edges have a value of 0, then

$$ \frac{x}{n} < t $$

$$ x < nt $$

This does not look very promising as a form of edge sparsification, but i wonder if it could be valuable as a form of node sparsification.

If every edge incident on a node $i$ has a weight greater than $t$, then $i$ can't be a part of any cluster.

fgregg commented 2 years ago

Extracting the multiscale backbone of complex weighted networks is very local. Dyads that were very low scoring would not be removed. Not quite what we want.

fgregg commented 2 years ago

If every edge incident on a node $i$ has a weight greater than $t$, then $i$ can't be a part of any cluster.

just did a quick and dirty check on csv example, and filtering out nodes like this this removed a bit over 10% of edges

fgregg commented 2 years ago

Another thing we could sparsify

If an edge is the only path between two nodes, and the edge weight is greater then $t$ (i.e. $w_{c,d} > t$), then we can disregard that edge.

graph TD;
    A-->C;
    B-->C;
    C-->D;
    D-->E;
    D-->F;

Don't know if there's an efficient way to identify such single path.

fgregg commented 2 years ago

Actually, if the edges between two parts of a graph are such that their weights are greater than the threshold, we should be able to cut those edges.

graph TD;
    A-->C;
    B-->C;
    B-->F;
    C-->D;
    D-->E;
    D-->F;

if $w{c,d}$ and $w{b,f}$ are both greater than the threshold, we should be able to cut both of them.

fgregg commented 2 years ago

keep going down this road, and we are basically re-implementing clustering...