danielplohmann / mcrit

The MinHash-based Code Relationship & Investigation Toolkit (MCRIT) is a framework created to simplify the application of the MinHash algorithm in the context of code similarity.
GNU General Public License v3.0
86 stars 12 forks source link

Add TTL to query_* documents #59

Closed yankovs closed 8 months ago

yankovs commented 10 months ago

In an automated system, the insert to the query_* collections during a query grows very large very quick. After a couple of weeks or months since query have passed the MCRIT db has probably changed so this sample will probably require a re-query anyway. So saving old results is not that useful.

It would be nice it there was an option to turn on TTL and just remove such query-related data after some user-defined period

yankovs commented 10 months ago

I see that query_samples documents already have a timestamp key so TTL shouldn't be a problem (using the builtin feature for TTLs or otherwise) but query_functions is a different story. Should a timestamp be appended to it as well? 🤔

danielplohmann commented 10 months ago

Hey!

Functions are not supposed to exist on their own (i.e. without a sample associated with them), therefore the way to go would be to delete all functions based on sample_id from query_functions whenever a sample is removed from query_samples

yankovs commented 10 months ago

I see. So the preferred way would be to create a manual service that deletes a sample from query_samples and its associated functions?

How would that work if a sample appears multiple times in query_samples? Would the id be the same? because if not, how can you know what function is for this specific entry of this function in this collection and not another one?

danielplohmann commented 10 months ago

Yes, it would be the same, there is a unique relationship between a sample_id and the associated function_ids, functions are never shared across sample_ids.

yankovs commented 10 months ago

So a delete of the functions from a specific sample will delete the functions for any other instance of this sample in query_samples?

danielplohmann commented 10 months ago

Hey! No, each instance in query_samples will have their own set of functions in query_functions, at least if you submitted with the force_recalculation=True parameter:

> db.query_samples.find({}, {"sha256": 1, "sample_id": 1})
{ "_id" : ObjectId("65951f011d9b2d99e3ee2773"), "sample_id" : -1, "sha256" : "b5d03910de7d2f2ed1a95c5cafc899444451b75f62d13788adcc998d3e13492d" }
{ "_id" : ObjectId("65951f181d9b2d99e3ee2b62"), "sample_id" : -2, "sha256" : "b5d03910de7d2f2ed1a95c5cafc899444451b75f62d13788adcc998d3e13492d" }
{ "_id" : ObjectId("65951f1f1d9b2d99e3ee2f51"), "sample_id" : -3, "sha256" : "b5d03910de7d2f2ed1a95c5cafc899444451b75f62d13788adcc998d3e13492d" }
{ "_id" : ObjectId("65951f271d9b2d99e3ee3340"), "sample_id" : -4, "sha256" : "b5d03910de7d2f2ed1a95c5cafc899444451b75f62d13788adcc998d3e13492d" }
> db.query_functions.find({"offset": 4243663,}, {"function_id": 1, "sample_id": 1})
{ "_id" : ObjectId("65951f011d9b2d99e3ee2786"), "function_id" : -19, "sample_id" : -1 }
{ "_id" : ObjectId("65951f181d9b2d99e3ee2b75"), "function_id" : -1023, "sample_id" : -2 }
{ "_id" : ObjectId("65951f201d9b2d99e3ee2f64"), "function_id" : -2027, "sample_id" : -3 }
{ "_id" : ObjectId("65951f281d9b2d99e3ee3353"), "function_id" : -3031, "sample_id" : -4 }

So the approach would be to first query based on timestamp filter to get all the sample_ids in question and then remove entries with that sample_id from both collections (query_samples, query_functions):

> db.query_samples.find({"timestamp": {"$lt": "2024-01-03T08-47-20"}}, {"sha256": 1, "sample_id": 1, _id: 0})
{ "sample_id" : -1, "sha256" : "b5d03910de7d2f2ed1a95c5cafc899444451b75f62d13788adcc998d3e13492d" }
{ "sample_id" : -2, "sha256" : "b5d03910de7d2f2ed1a95c5cafc899444451b75f62d13788adcc998d3e13492d" }
> db.query_samples.remove({"sample_id": {"$in": [-1, -2]}})
WriteResult({ "nRemoved" : 2 })
> db.query_functions.remove({"sample_id": {"$in": [-1, -2]}})
WriteResult({ "nRemoved" : 2008 })
danielplohmann commented 9 months ago

Flow of action for a potential implementation, as discussed yesterday:

danielplohmann commented 8 months ago

Alright, this should fully work now. At least locally, it behaved as expected and removed query sample, their jobs and results if they were beyond the temporal cutoff. For SMDA reports, this was more tricky, as the sample's sha256 was not easily accessible. I now opted for parsing it out of the SMDA report with a regex and no deserialization but that seems to work. Feel free to play around with it and let me know if it works as intended!