Thanks for sharing the great codes!! They have been very useful for me!
I'm new to Rust and bloom filter and I have one question regarding the deduplication scope in your code -- I saw it runs let bloom_filter = bloom_filter.clone(); for each input file. Does this mean the bloom filter won't be synced across threads, i.e., the deduplication scope isn't global? I also wonder what is the best practice for me to run multi-thread processing if I have a very large pretraining corpus to process?
Thanks for sharing the great codes!! They have been very useful for me!
I'm new to Rust and bloom filter and I have one question regarding the deduplication scope in your code -- I saw it runs
let bloom_filter = bloom_filter.clone();
for each input file. Does this mean the bloom filter won't be synced across threads, i.e., the deduplication scope isn't global? I also wonder what is the best practice for me to run multi-thread processing if I have a very large pretraining corpus to process?Appreciate your reply. Thank you!!