allenai / bff

Apache License 2.0
38 stars 9 forks source link

Is the deduplication scope separate or global when deduplicating multiple files? #9

Closed RulinShao closed 5 months ago

RulinShao commented 11 months ago

Thanks for sharing the great codes!! They have been very useful for me!

I'm new to Rust and bloom filter and I have one question regarding the deduplication scope in your code -- I saw it runs let bloom_filter = bloom_filter.clone(); for each input file. Does this mean the bloom filter won't be synced across threads, i.e., the deduplication scope isn't global? I also wonder what is the best practice for me to run multi-thread processing if I have a very large pretraining corpus to process?

Appreciate your reply. Thank you!!

dirkgr commented 10 months ago

The scope is global. That .clone() command only clones a pointer to the bloom filter. All threads use the same filter.