Open fgregg opened 4 years ago
cc @fjsj, perhaps relevant to your interests.
really interesting idea:
Thanks @fgregg, I'll investigate.
With current implementation, I had memory errors at this line, when numpy needs to materialize part of the mmap-ed scored_pairs/edgelist
array into main memory because of the fancy indexing:
(Edit: Note the fancy/advanced indexing produces a copy in main memory (instead of a view) because that's the default numpy behavior for all arrays (not only mmap-ed ones). See: https://stackoverflow.com/a/18620619/145349)
I'm not 100% sure, but I think numpy is smart enough to not load the whole scored_pairs
into main memory up to that line. That line is where things break for me. Therefore, do any of those papers describe a way of running connected components without materializing a large component in main memory?
Also, worth noting most of my problems were solved when I stopped using non-integer IDs. This dramatically reduced the size of scored_pairs
. Maybe it's worth adding a warning when sniff_id_type
detects a non-integer ID type. Please LMK if this is something you agree with so I can make a small PR.
thanks that's very helpful.
@fjsj i think i have some ideas of what would help here. would it possible for you to send me an edgelist for job that ran into memory problems (scored_pairs['pairs']
)
@fgregg Good to know! I can't share it because it's private data, but I'll try to generate one.
don't need the underlying data, just the id pairs.
I've addressed the memmap issue, but much of union-find is still in-memory. Le's see how much headroom the memmap fix buys us.
superseded by #826
this is still the bottleneck, and we haven't really made a lot of progress on #826
here's a potential SQL solution: (basically merge-find: https://www.db-fiddle.com/f/2zoZgeAaKsXS9x6EyVCyVo/6, inspired by this paper: https://arxiv.org/abs/1802.09478)
but @mccc's idea of trying to do this in the memmaped array makes a lot of sense still.
what we do now, is do an iterated union-find. this means that we create a union-find-ish data structure in memory, but it would be more efficient both in time and memory to do the simplest thing and
@fgregg so even after avoiding the random access at sub_graph = edgelist[component]
(with sub_graph = edgelist[start:stop]
) the bottleneck still exists? Are there any offending lines or the whole algorithm is slow?
you know! i think i'm not sure! i'm running into a problem with this, but i think it's with a version of the library that's before this fix.
if we have a super block, then https://github.com/dedupeio/dedupe/blob/8bfcfb094e0e38bcc55c2c7e87c9161a057c6e11/dedupe/clustering.py#L145 can still be a problem
more on mmap connected components
right now our component dict keeps track of edgelist indices, it would probably be much more memory efficient to track vertices
but how efficient would be to grab edges from vertices? Seems to be a memory locality trade-off
graphframes library (Spark-based) uses this algorithm. Related issue is here https://github.com/graphframes/graphframes/issues/117
The current scaling bottleneck is usually the connected components step in clustering. The current algorithm loads the entire edgelist into memory.
We need to move this to disk.
Things worth exploring
connected components in sqlite https://stackoverflow.com/questions/33465859/a-number-of-connected-components-of-a-graph-in-sql
look for or implement a specialized, lightweight graph store like https://link.springer.com/chapter/10.1007/978-3-319-01863-8_20