Closed GSGerritsen closed 3 years ago
You could definitely use dedup_iter
to achieve that without having to load everything in memory. According to its doc, it is An iterator that removes elements that are the same as previous one.
. So, sorting using extsort
and then calling one of the deduping method on the resulting iterator using one of the dedup_iter
adapters should dedup in streaming. As long as you don't do any in-memory aggregation after deduping obviously.
Hope this helps.
Awesome, thank you very much for the response. That helps to clarify my original question, but I have a new one that would be great to get insight on (let me know if you want me to make a new issue and put this question there if you'd like to keep things separated).
How does extsort
determine how large to make its sorting buffer? My guess would be that this is based on the size of the thing implementing Sortable
? The reason I ask is that in my case, in order to produce the iterator containing the structs I want to sort, I need to do some pre-processing to actually yield those structs.
So unlike the example in the docs, where the iterator that gets fed into sort
is just a simple mapping of a struct over integers, the iterator I'd be feeding into sort
would do something like: read a text file, for each pair of lines do some work, finally yielding MyStructToSort
. So my concern is that the memory usage of each item in the iterator is potentially greater than the size of MyStructToSort
itself.
So I suppose my question is: what should I be thinking about when it comes to the memory demands of the iterator I pass into sorter.sort()
?
I was thinking that if the API somehow supported streaming directly into the ExternalSorter
then I could do something like:
for line in my_file {
let my_struct = do_pre_processing_work(line);
external_sorter.push(my_struct);
}
Which would sort of allow me to separate the memory demands of the work needed to generate my_struct
s. But instead I will probably try something like:
let data = file_contents.iter().map(|line| do_pre_processing_work(line));
sorter.sort(data)
And it's the memory usage inside the map
call that I'm a bit uncertain about whether it will work or not. Sorry that was a bit long but if you have any ideas / warnings / advice that would be great!
edit: I probably should have read the docs more 🤦 - I can probably just fiddle with with_segment_size
until I find something that satisfies whatever the pre-processing work demands for memory right?
How does
extsort
determine how large to make its sorting buffer? My guess would be that this is based on the size of the thing implementingSortable
? The reason I ask is that in my case, in order to produce the iterator containing the structs I want to sort, I need to do some pre-processing to actually yield those structs.
Unfortunately, there is no reliable way to get the size of the sortable elements. A simple struct that only contains primitive items is easy to size, but as soon as you start using heap allocated structures (ex: String, Vec, etc.), there is no easy way to get the total size referenced by the struct.
Because of that, the size of the buffer is in number of items instead of by the total size of their allocated memory. The default is 10000 items, but it can be changed via with_segment_size(size)
.
Another approach that could have been taken by extsort
is to serialize sorted items as they are buffered. By doing this, it would have allowed the library to more precisely control its memory usage. But extsort
was created to still perform fast when the number of sorted items fit in the buffer. This tradeoff allows extsort
to prevent hitting the disk when the data fits in memory by simply using the sorted buffered items. This greatly improve the performance on smaller datasets.
So unlike the example in the docs, where the iterator that gets fed into sort is just a simple mapping of a struct over integers, the iterator I'd be feeding into sort would do something like: read a text file, for each pair of lines do some work, finally yielding MyStructToSort. So my concern is that the memory usage of each item in the iterator is potentially greater than the size of MyStructToSort itself.
Rust iterators are "pull", so extsort
would pulls from whatever source you give it. So your first code snippet is not possible with this kind of pattern. Your second snippet would be the way to go, as long as file_contents
iterates from the file system directly instead of loading everything in memory first. As for the memory usage, whatever is returned by your map
in the second snippet would be buffered in memory and dumped to disk into segments once it reaches the configured segment size.
In the case of a simple line delimited file, you could simply use BufRead's lines()
method that returns an iterator over the lines of the file. This iterator (see the simple code here) doesn't buffer anything (other than the small BufRead
buffer) and could be passed to extsort
directly.
edit: I probably should have read the docs more 🤦 - I can probably just fiddle with with_segment_size until I find something that satisfies whatever the pre-processing work demands for memory right?
Yes, this is the only way for now.
Thank you, that explanation makes sense and using with_segment_size
worked well to adjust how much memory I could be throwing at the sorting.
Let's say I was doing something in-memory like this:
I can use this crate to accomplish the sorting part, but I'd like to not defeat its purpose by then removing duplicates in memory. Do you have any recommendations for how I might be able to dedup the thing getting sorted externally in a way that fits with this crate? I'm wondering if possibly plugging in this dedup adapter might work.