dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.12k stars 551 forks source link

Calculate connected components #62

Closed fgregg closed 10 years ago

fgregg commented 11 years ago

Right now we depend upon networkx to calculated the connected components of the our dupes. However we only use a single method, connected_components.

Might be better to just implement that one algorithm and take the dupes numpy array as an input. http://stackoverflow.com/questions/13191010/a-fast-way-to-find-connected-component-in-a-1-nn-graph

fgregg commented 10 years ago

See also connectivity constraints: http://scikit-learn.org/stable/modules/clustering.html#adding-connectivity-constraints

https://github.com/scikit-learn/scikit-learn/blob/e6003bc892485756215ac4966b18720bc489e5c6/sklearn/cluster/hierarchical.py#L255

fgregg commented 10 years ago

@chancyk, this is probably the memory bottleneck now

zmaril commented 10 years ago

I talked to @boblannon and @apendleton about this some. We agree with your assumption that the connected components calculation is eating memory (and thus slowing down some computations when the machine starts using swapped memory instead of main). I'll recount some of the thinking here:

dedupe will sometimes choke during the clustering stage. During this, dedupe calls out to networkx and uses it to calculate connected components https://github.com/datamade/dedupe/blob/master/dedupe/clustering.py#L70-73 This involves converting the list of edges, which is simply a numpy array, into a Graph object for networkx to work with.

networkx represents all graphs as a dict of dicts, as can be read here:

"The graph internal data structures are based on an adjacency list representation and implemented using Python dictionary datastructures. The graph adjaceny structure is implemented as a Python dictionary of dictionaries; the outer dictionary is keyed by nodes to values that are themselves dictionaries keyed by neighboring node to the edge attributes associated with that edge. This “dict-of-dicts” structure allows fast addition, deletion, and lookup of nodes and neighbors in large graphs. The underlying datastructure is accessed directly by methods (the programming interface “API”) in the class definitions. All functions, on the other hand, manipulate graph-like objects solely via those API methods and not by acting directly on the datastructure. This design allows for possible replacement of the ‘dicts-of-dicts’-based datastructure with an alternative datastructure that implements the same methods."

For calculating the connected components, this is one of the worst possible choices of representations. We don't need to know anything about the nodes themselves beyond their id's and holding onto a full object for each node is overkill. It is totally possible also some weird internal python memory representations are coming into play if the size of the list gets too large. Basically, we'd like to keep the objects as small and the lists as short as possible to avoid weird python memory things.

We recommend that networkx not be used to find connected components. While networkx is a generally good toolkit, it is particularly ill suited for this particular application. The Union-Find algorithm mentioned above should be both better suited and easier to implement.

fgregg commented 10 years ago

@zmaril thanks so much for this research!

fgregg commented 10 years ago

http://stackoverflow.com/questions/3067529/looking-for-a-set-union-find-algorithm