dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.16k stars 551 forks source link

disk based connected components #819

Open fgregg opened 4 years ago

fgregg commented 4 years ago

The current scaling bottleneck is usually the connected components step in clustering. The current algorithm loads the entire edgelist into memory.

We need to move this to disk.

Things worth exploring

  1. connected components in sqlite https://stackoverflow.com/questions/33465859/a-number-of-connected-components-of-a-graph-in-sql

  2. look for or implement a specialized, lightweight graph store like https://link.springer.com/chapter/10.1007/978-3-319-01863-8_20

fgregg commented 4 years ago

cc @fjsj, perhaps relevant to your interests.

fgregg commented 4 years ago

really interesting idea:

https://twitter.com/mccastelle/status/1263272272787320840

fgregg commented 4 years ago

https://arxiv.org/abs/1802.09478

fjsj commented 4 years ago

Thanks @fgregg, I'll investigate.

With current implementation, I had memory errors at this line, when numpy needs to materialize part of the mmap-ed scored_pairs/edgelist array into main memory because of the fancy indexing:

https://github.com/dedupeio/dedupe/blob/20736bb693aeffa0ac6fb1990485fb39f60c0d7b/dedupe/clustering.py#L36

(Edit: Note the fancy/advanced indexing produces a copy in main memory (instead of a view) because that's the default numpy behavior for all arrays (not only mmap-ed ones). See: https://stackoverflow.com/a/18620619/145349)

I'm not 100% sure, but I think numpy is smart enough to not load the whole scored_pairs into main memory up to that line. That line is where things break for me. Therefore, do any of those papers describe a way of running connected components without materializing a large component in main memory?

Also, worth noting most of my problems were solved when I stopped using non-integer IDs. This dramatically reduced the size of scored_pairs. Maybe it's worth adding a warning when sniff_id_type detects a non-integer ID type. Please LMK if this is something you agree with so I can make a small PR.

fgregg commented 4 years ago

thanks that's very helpful.

fgregg commented 4 years ago

@fjsj i think i have some ideas of what would help here. would it possible for you to send me an edgelist for job that ran into memory problems (scored_pairs['pairs'])

fjsj commented 4 years ago

@fgregg Good to know! I can't share it because it's private data, but I'll try to generate one.

fgregg commented 4 years ago

don't need the underlying data, just the id pairs.

fgregg commented 4 years ago

I've addressed the memmap issue, but much of union-find is still in-memory. Le's see how much headroom the memmap fix buys us.

fgregg commented 4 years ago

superseded by #826

fgregg commented 2 years ago

this is still the bottleneck, and we haven't really made a lot of progress on #826

here's a potential SQL solution: (basically merge-find: https://www.db-fiddle.com/f/2zoZgeAaKsXS9x6EyVCyVo/6, inspired by this paper: https://arxiv.org/abs/1802.09478)

but @mccc's idea of trying to do this in the memmaped array makes a lot of sense still.

what we do now, is do an iterated union-find. this means that we create a union-find-ish data structure in memory, but it would be more efficient both in time and memory to do the simplest thing and

  1. starting at a random edge, then do DFS or BFS to find all the edges connected to that
  2. label all the visited edges with the same component label
  3. keep track of all edges visited
  4. once BFS or DFS terminates, pick an unvisited edge and repeat procedure
  5. repeat until there are not unvisited edges
fjsj commented 2 years ago

@fgregg so even after avoiding the random access at sub_graph = edgelist[component] (with sub_graph = edgelist[start:stop]) the bottleneck still exists? Are there any offending lines or the whole algorithm is slow?

fgregg commented 2 years ago

you know! i think i'm not sure! i'm running into a problem with this, but i think it's with a version of the library that's before this fix.

fgregg commented 2 years ago

if we have a super block, then https://github.com/dedupeio/dedupe/blob/8bfcfb094e0e38bcc55c2c7e87c9161a057c6e11/dedupe/clustering.py#L145 can still be a problem

fgregg commented 2 years ago

more on mmap connected components

fgregg commented 2 years ago

right now our component dict keeps track of edgelist indices, it would probably be much more memory efficient to track vertices

fjsj commented 2 years ago

but how efficient would be to grab edges from vertices? Seems to be a memory locality trade-off

fjsj commented 2 years ago

graphframes library (Spark-based) uses this algorithm. Related issue is here https://github.com/graphframes/graphframes/issues/117