Using all_pairs() on a 14k dataset causes MemoryErrors

tsela commented 4 years ago

Hi,

I'm trying to use the all_pairs() function to find all the (near-)duplicates in a set of about 14000 text documents (after turning them into ngram shingles first). However, I'm running against MemoryErrors crashing my script, despite the virtual machine I work on having 16GB of RAM. I've checked, and indeed the entire RAM and swap get maxed out before the script stops working.

Do you have any advice on how to reduce RAM usage, or any indication of how much memory the algorithm uses? I don't run into any memory issues when I use LSH with datasketch, but I'd rather have an exact list of duplicates.

tsela commented 4 years ago

Quick and wholly unscientific benchmark:

When run on about 4000 documents all_pairs() takes about 1 minute and consumes at most 1.8GB of RAM;
When run on about 8000 documents all_pair() takes about 19 minutes and consumes at most 10.6GB of RAM.

So I understand why running it on the full 14k dataset would cause MemoryErrors.

Still, I really need to be able to run all_pairs() on the full dataset, and I expect to have to run it on even bigger datasets in the future. Any idea on how to reduce memory usage?

ekzhu commented 4 years ago

Thanks for your issue! The exact all-pair search algorithm builds an in-memory data structure (posting lists), and the size can be as big as your original input. Is your input text corpus also in memory? What is the average size of the documents?

ekzhu commented 4 years ago

You can also try the Go version which is probably more efficient due to the programming language used.

ekzhu commented 4 years ago

Issue #5 is also relevant to your problem, you can take a look at my response to that.

tsela commented 4 years ago

Hi, thanks for your answer!

No, my input text corpus is not in memory, they are all text documents saved on disk (after conversion from a corpus of documents of various formats using Tika). But even if it was, that would not explain the runaway memory usage: after conversion to text, I truncate all the large documents down to about 100kB, so that's the maximum size of any document in my corpus prior to using SetSimilaritySearch. That means the corpus can't ever be larger than 1.4GB whatever happens.

As for the average size of the documents, after truncating the larger ones, it's about 20kB, so in total the corpus shouldn't be more than about 300MB.

The only two things I keep in memory are a list of file paths for all the files in the corpus, and the files' character shingles, which I naturally need in order to run all_pair() on them.

Unfortunately, I can't use the Go version: I'm not familiar with the language at all, and my set-up is Python-only.

If you have any idea where the runaway memory usage comes from, I would very much like to hear it.

tsela commented 4 years ago

Hi again,

Your reply made me think a little more, and I tried to assess the size of the character shingles rather than the corpus itself, since it's the character shingles that are kept in memory and used by all_pair(). And character shingles can be larger than the document they are created from. But even then, after calculating the size of the character shingles for about half of the dataset (the only bit I still have available for now), I arrive at an average size of only 372kB per document. That means the entire corpus of 14000 documents would tower at around 5GB as shingles. That's still not enough to explain the memory errors, as even with the entire corpus in memory, and all_pair() producing an in-built data structure as large as the original corpus, we're still only at 10GB, and my virtual machine has 16GB of memory and 8GB of swap.

So something else must be going on that causes the runaway memory issue.

tsela commented 4 years ago

BTW, would it be possible to have all_pair() accept an iterator of sets rather than have it force you to give it a list? This would allow use of iterators that provide the data in a memory efficient way, and would avoid having the data twice in memory, as input list and as built-in data structure.

ekzhu commented 4 years ago

Interesting. The likely memory bottleneck is in the function _frequency_order_transform. Could you print the length of counts and order? It is possible that your shingled documents are very dissimilar -- so the intermediate data structures counts and order blow up.

ekzhu commented 4 years ago

Because _frequency_order_transform must scan the input sets twice, all_pair() can't accept an iterator. Of course it is possible to modify it to accept a function that returns an iterator.

ekzhu commented 4 years ago

@tsela can you try the new branch callable_sets? You can pass a function that returns an iterator to the input sets. Another thing I would try is to decrease the shingle size so your documents can be more similar.

tsela commented 4 years ago

Hi @ekzhu,

Thanks for all your work! I think you've diagnosed the problem right: using the dataset I have at hand (which contains about 7k of the 14k documents in the corpus), both order and counts clock up at about 23 million members. Also, using System Monitor to check RAM usage I noticed that creating them for my 7k documents already consumes about 3.5GB of RAM per object (the list of all shingled documents has a size of about 2.5GB, but somehow creating this list actually consumes closer to 4GB of RAM). This means with just 7k documents my system has to have more than 7GB of free RAM to work. No wonder it blows up with 14k documents (knowing the total number of different shingles probably doesn't grow linearly with the size of the corpus. I unfortunately have no time to test that).

I'll try your recommendation of decreasing the shingle size, as I agree it should have a strong effect on the size of the intermediary structures. And I'll try your new branch afterwards, as soon as I am able. Unfortunately, I have to work on other projects right now, so I won't be able to get on it immediately. I'll get back to you as soon as I am able.

ekzhu / SetSimilaritySearch

Using all_pairs() on a 14k dataset causes MemoryErrors #4