Recommendations for deduplication?

GSGerritsen commented 3 years ago

Let's say I was doing something in-memory like this:

some_vec.par_sort_unstable_by(...sort_fn);
some_vec.dedup();

I can use this crate to accomplish the sorting part, but I'd like to not defeat its purpose by then removing duplicates in memory. Do you have any recommendations for how I might be able to dedup the thing getting sorted externally in a way that fits with this crate? I'm wondering if possibly plugging in this dedup adapter might work.

appaquet commented 3 years ago

You could definitely use dedup_iter to achieve that without having to load everything in memory. According to its doc, it is An iterator that removes elements that are the same as previous one.. So, sorting using extsort and then calling one of the deduping method on the resulting iterator using one of the dedup_iter adapters should dedup in streaming. As long as you don't do any in-memory aggregation after deduping obviously.

Hope this helps.

GSGerritsen commented 3 years ago

Awesome, thank you very much for the response. That helps to clarify my original question, but I have a new one that would be great to get insight on (let me know if you want me to make a new issue and put this question there if you'd like to keep things separated).

How does extsort determine how large to make its sorting buffer? My guess would be that this is based on the size of the thing implementing Sortable? The reason I ask is that in my case, in order to produce the iterator containing the structs I want to sort, I need to do some pre-processing to actually yield those structs.

So unlike the example in the docs, where the iterator that gets fed into sort is just a simple mapping of a struct over integers, the iterator I'd be feeding into sort would do something like: read a text file, for each pair of lines do some work, finally yielding MyStructToSort. So my concern is that the memory usage of each item in the iterator is potentially greater than the size of MyStructToSort itself.

So I suppose my question is: what should I be thinking about when it comes to the memory demands of the iterator I pass into sorter.sort()?

I was thinking that if the API somehow supported streaming directly into the ExternalSorter then I could do something like:

for line in my_file {
    let my_struct = do_pre_processing_work(line);
    external_sorter.push(my_struct);
}

Which would sort of allow me to separate the memory demands of the work needed to generate my_structs. But instead I will probably try something like:

let data = file_contents.iter().map(|line| do_pre_processing_work(line));
sorter.sort(data)

And it's the memory usage inside the map call that I'm a bit uncertain about whether it will work or not. Sorry that was a bit long but if you have any ideas / warnings / advice that would be great!

edit: I probably should have read the docs more 🤦 - I can probably just fiddle with with_segment_size until I find something that satisfies whatever the pre-processing work demands for memory right?

appaquet commented 3 years ago

How does extsort determine how large to make its sorting buffer? My guess would be that this is based on the size of the thing implementing Sortable? The reason I ask is that in my case, in order to produce the iterator containing the structs I want to sort, I need to do some pre-processing to actually yield those structs.

Unfortunately, there is no reliable way to get the size of the sortable elements. A simple struct that only contains primitive items is easy to size, but as soon as you start using heap allocated structures (ex: String, Vec, etc.), there is no easy way to get the total size referenced by the struct.

Because of that, the size of the buffer is in number of items instead of by the total size of their allocated memory. The default is 10000 items, but it can be changed via with_segment_size(size).

Another approach that could have been taken by extsort is to serialize sorted items as they are buffered. By doing this, it would have allowed the library to more precisely control its memory usage. But extsort was created to still perform fast when the number of sorted items fit in the buffer. This tradeoff allows extsort to prevent hitting the disk when the data fits in memory by simply using the sorted buffered items. This greatly improve the performance on smaller datasets.

So unlike the example in the docs, where the iterator that gets fed into sort is just a simple mapping of a struct over integers, the iterator I'd be feeding into sort would do something like: read a text file, for each pair of lines do some work, finally yielding MyStructToSort. So my concern is that the memory usage of each item in the iterator is potentially greater than the size of MyStructToSort itself.

Rust iterators are "pull", so extsort would pulls from whatever source you give it. So your first code snippet is not possible with this kind of pattern. Your second snippet would be the way to go, as long as file_contents iterates from the file system directly instead of loading everything in memory first. As for the memory usage, whatever is returned by your map in the second snippet would be buffered in memory and dumped to disk into segments once it reaches the configured segment size.

In the case of a simple line delimited file, you could simply use BufRead's lines() method that returns an iterator over the lines of the file. This iterator (see the simple code here) doesn't buffer anything (other than the small BufRead buffer) and could be passed to extsort directly.

edit: I probably should have read the docs more 🤦 - I can probably just fiddle with with_segment_size until I find something that satisfies whatever the pre-processing work demands for memory right?

Yes, this is the only way for now.

GSGerritsen commented 3 years ago

Thank you, that explanation makes sense and using with_segment_size worked well to adjust how much memory I could be throwing at the sorting.

appaquet / extsort-rs

Recommendations for deduplication? #16