danielplohmann / mcrit

The MinHash-based Code Relationship & Investigation Toolkit (MCRIT) is a framework created to simplify the application of the MinHash algorithm in the context of code similarity.
GNU General Public License v3.0
86 stars 12 forks source link

Refactor doDbCleanup to support deletion of multiple query_samples #72

Closed yankovs closed 6 months ago

yankovs commented 7 months ago

The current code for doDbCleanup has a few problems.

Let's look at a scenario: a sample with hash xyz was queried 5 times. Right now the algorithm is to go over finished jobs, find a job connected to xyz, and then search the query_samples collection for a document associated with xyz. At this point samples_to_be_deleted will contain {"xyz": some_entry} where some_entry is just one of the 5 queries done for this sample - and if I understand the code correctly - it is not even guaranteed to be the one associated with the job (from the queue collection) we are currently looking at. The remaining 4 documents in query_samples will not be deleted.

Then we get to failed job deletion, and there are a couple of scenarios. If this sample was found when going over finished jobs, then xyz is already in sample_to_be_deleted so we skip deletion of query_samples/query_functions of all failed jobs for this sample. If this sample was never queried successfully without failing, then again we only delete the first failed query_samples/query_functions.

Instead, I think there should at least be a way to delete more query_samples/query_functions entries for a job. This is a draft for doing that, let me know what you think