Open jlekas opened 4 years ago
We need to incorporate FAISS into the clusterizer.
That said, the program might have run out of RAM as it's a single-threaded all-in-one demo -- do you have more context on why it crashed, and/or what you saw on the terminal around the time of the crash?
The program was left running for 1 - 2 days and at a certain point I switched to the open terminal and had stopped running although I unfortunately do not remember the output. Do you think it would it be worthwhile attempting to fork the project and writing a multi threaded or multi processed version of the file to help it run on larger data sets, or would it be better to wait for FAISS to be incorporated into the clusterizer?
@jlekas both. :)
I really apologize for the delay. My team is actively working on unblocking a couple key flows in ThreatExchange and I've not been able to dedicate time to unblocking TMK+FAISS ...
Oh, no worries thank you so much for your timely responses and help in solving this problem. I will look into writing a multi threaded and multi processed version of the file to see if it works well for my larger data set.
This issue is being marked as stale because it has no recent activity. It will be closed automatically in 14 days unless it becomes active before then. To prevent closing, please comment on the issue before that time. If the issue is no longer relevant, please feel free to close it prior to that time.
Cleaning up stale issues helps redirect focus to the issues top of mind of the community. Thank you for your help with this.
This issue has been closed due to no recent activity. If you need this issue reopened, please let us know. Thanks!
@dxdc - How are things going after your changes - is it performant enough to consider this task closed out?
@Dcallies I've tested up to 25k tmk files, works great. There may be some better ways to optimize the parallelization, but it would be for smaller scale improvements at this point I would imagine.
I was attempting to use the tmk-clusterize tool on a set of around 50,000 videos and the program crashed before it was finished running. Is there a limit to the number of hash files that the clusterize tool can operate on?