The MinHash-based Code Relationship & Investigation Toolkit (MCRIT) is a framework created to simplify the application of the MinHash algorithm in the context of code similarity.
GNU General Public License v3.0
86
stars
12
forks
source link
Refactor doDbCleanup to support deletion of multiple query_samples #72
The current code for doDbCleanup has a few problems.
Let's look at a scenario: a sample with hash xyz was queried 5 times. Right now the algorithm is to go over finished jobs, find a job connected to xyz, and then search the query_samples collection for a document associated with xyz. At this point samples_to_be_deleted will contain {"xyz": some_entry} where some_entry is just one of the 5 queries done for this sample - and if I understand the code correctly - it is not even guaranteed to be the one associated with the job (from the queue collection) we are currently looking at. The remaining 4 documents in query_samples will not be deleted.
Then we get to failed job deletion, and there are a couple of scenarios. If this sample was found when going over finished jobs, then xyz is already in sample_to_be_deleted so we skip deletion of query_samples/query_functions of all failed jobs for this sample. If this sample was never queried successfully without failing, then again we only delete the first failed query_samples/query_functions.
Instead, I think there should at least be a way to delete more query_samples/query_functions entries for a job. This is a draft for doing that, let me know what you think
The current code for
doDbCleanup
has a few problems.Let's look at a scenario: a sample with hash
xyz
was queried 5 times. Right now the algorithm is to go over finished jobs, find a job connected toxyz
, and then search the query_samples collection for a document associated withxyz
. At this pointsamples_to_be_deleted
will contain{"xyz": some_entry}
where some_entry is just one of the 5 queries done for this sample - and if I understand the code correctly - it is not even guaranteed to be the one associated with the job (from the queue collection) we are currently looking at. The remaining 4 documents in query_samples will not be deleted.Then we get to failed job deletion, and there are a couple of scenarios. If this sample was found when going over finished jobs, then
xyz
is already in sample_to_be_deleted so we skip deletion of query_samples/query_functions of all failed jobs for this sample. If this sample was never queried successfully without failing, then again we only delete the first failed query_samples/query_functions.Instead, I think there should at least be a way to delete more query_samples/query_functions entries for a job. This is a draft for doing that, let me know what you think