Closed konradkalita closed 1 year ago
It is possible, but you might need some code changes. For example, in minhash.py, you can see that each item is assigned a cluster id, you can save the entire dataset ds
, which has the cluster id, by adding one line afterward:
ds = ds.map(
function=lambda _, idx: {"__cluster__": uf.find(idx)},
with_indices=True,
num_proc=os.cpu_count(),
new_fingerprint=str(random.getrandbits(128)),
desc="Finding clusters...",
)
ds.save_to_disk("OUTPUT_DIR")
Thanks, that works for me. Actually I wanted to ask you do you have any recommendations for setting minhash parameters to decrease the amount of space which is needed to store hashes? I have a collection of about 200 million of short texts which occupies around 10GB of disk space, but running minhash pipeline generates about 500GB of hashes which seems crazy for me. I guess I should tune num_perm
and ngram
parameters but I have no idea what values should be especially for num_perm
.
BTW: Thank you for your work on this project.
I would suggest experimenting with a small portion (as much as you can comfortably handle) of your data to figure out some good settings. For short text and limited space, here are some intuitions I have:
Short text is challenging in its own way and unfortunately I don't have much experience in that. Please feel free to share any issues you may encounter, I am more than happy to help.
Closing this for now, feel free to open another issue if you have any more questions.
Hi, my use case is that I would not only like to remove duplicates from a dataset but also do some analytic on what was clustered as duplicates. So in the result I would like to have table with colums
example_id
,cluster_id
. Is it possible with current code? If not what would be the best place to add that feature?