Zyphra / Zyda_processing

Apache License 2.0
21 stars 1 forks source link

ditributing job in lsh indexing #3

Open simplew2011 opened 2 months ago

simplew2011 commented 2 months ago
yury-tokpanov commented 2 months ago

Thank you for the suggestion.

You don't need constant communication between nodes, as building the LSH index is parallelized by bands (processing of every band is completely independent). The index of every band is written into a separate file. We kept them separately on every node, and only combined them in a single folder when copying to our backup storage.

If you want to inspect all the bands indices on a single machine, you'll have to move them to a storage accessible on that machine.

After we build LSH index, we perform a separate step of combining duplicate candidates from separate bands into a single set of duplicates. And then this set is used for building a graph and finding connected components.

This is a minimal implementation of building and using LSH for the purposes of deduplication of datasets. It's not supposed to be a full efficient implementation of a standalone distributed LSH index.

We could add more information about how we actually did a distributed run to the documentation.