facebook / ThreatExchange

Trust & Safety tools for working together to fight digital harms.
https://developers.facebook.com/docs/threat-exchange
Other
1.16k stars 307 forks source link

TMK Clusterize on thousands of .tmk files #211

Open jlekas opened 4 years ago

jlekas commented 4 years ago

I was attempting to use the tmk-clusterize tool on a set of around 50,000 videos and the program crashed before it was finished running. Is there a limit to the number of hash files that the clusterize tool can operate on?

johnkerl commented 4 years ago

We need to incorporate FAISS into the clusterizer.

That said, the program might have run out of RAM as it's a single-threaded all-in-one demo -- do you have more context on why it crashed, and/or what you saw on the terminal around the time of the crash?

jlekas commented 4 years ago

The program was left running for 1 - 2 days and at a certain point I switched to the open terminal and had stopped running although I unfortunately do not remember the output. Do you think it would it be worthwhile attempting to fork the project and writing a multi threaded or multi processed version of the file to help it run on larger data sets, or would it be better to wait for FAISS to be incorporated into the clusterizer?

johnkerl commented 4 years ago

@jlekas both. :)

I really apologize for the delay. My team is actively working on unblocking a couple key flows in ThreatExchange and I've not been able to dedicate time to unblocking TMK+FAISS ...

jlekas commented 4 years ago

Oh, no worries thank you so much for your timely responses and help in solving this problem. I will look into writing a multi threaded and multi processed version of the file to see if it works well for my larger data set.

github-actions[bot] commented 3 years ago

This issue is being marked as stale because it has no recent activity. It will be closed automatically in 14 days unless it becomes active before then. To prevent closing, please comment on the issue before that time. If the issue is no longer relevant, please feel free to close it prior to that time.

Cleaning up stale issues helps redirect focus to the issues top of mind of the community. Thank you for your help with this.

github-actions[bot] commented 3 years ago

This issue has been closed due to no recent activity. If you need this issue reopened, please let us know. Thanks!

Dcallies commented 2 years ago

@dxdc - How are things going after your changes - is it performant enough to consider this task closed out?

dxdc commented 2 years ago

@Dcallies I've tested up to 25k tmk files, works great. There may be some better ways to optimize the parallelization, but it would be for smaller scale improvements at this point I would imagine.