Add TTL to query_* documents

yankovs commented 10 months ago

In an automated system, the insert to the query_* collections during a query grows very large very quick. After a couple of weeks or months since query have passed the MCRIT db has probably changed so this sample will probably require a re-query anyway. So saving old results is not that useful.

It would be nice it there was an option to turn on TTL and just remove such query-related data after some user-defined period

yankovs commented 10 months ago

I see that query_samples documents already have a timestamp key so TTL shouldn't be a problem (using the builtin feature for TTLs or otherwise) but query_functions is a different story. Should a timestamp be appended to it as well? 🤔

danielplohmann commented 10 months ago

Hey!

Functions are not supposed to exist on their own (i.e. without a sample associated with them), therefore the way to go would be to delete all functions based on sample_id from query_functions whenever a sample is removed from query_samples

yankovs commented 10 months ago

I see. So the preferred way would be to create a manual service that deletes a sample from query_samples and its associated functions?

How would that work if a sample appears multiple times in query_samples? Would the id be the same? because if not, how can you know what function is for this specific entry of this function in this collection and not another one?

danielplohmann commented 10 months ago

Yes, it would be the same, there is a unique relationship between a sample_id and the associated function_ids, functions are never shared across sample_ids.

yankovs commented 10 months ago

So a delete of the functions from a specific sample will delete the functions for any other instance of this sample in query_samples?

danielplohmann commented 10 months ago

Hey! No, each instance in query_samples will have their own set of functions in query_functions, at least if you submitted with the force_recalculation=True parameter:

> db.query_samples.find({}, {"sha256": 1, "sample_id": 1})
{ "_id" : ObjectId("65951f011d9b2d99e3ee2773"), "sample_id" : -1, "sha256" : "b5d03910de7d2f2ed1a95c5cafc899444451b75f62d13788adcc998d3e13492d" }
{ "_id" : ObjectId("65951f181d9b2d99e3ee2b62"), "sample_id" : -2, "sha256" : "b5d03910de7d2f2ed1a95c5cafc899444451b75f62d13788adcc998d3e13492d" }
{ "_id" : ObjectId("65951f1f1d9b2d99e3ee2f51"), "sample_id" : -3, "sha256" : "b5d03910de7d2f2ed1a95c5cafc899444451b75f62d13788adcc998d3e13492d" }
{ "_id" : ObjectId("65951f271d9b2d99e3ee3340"), "sample_id" : -4, "sha256" : "b5d03910de7d2f2ed1a95c5cafc899444451b75f62d13788adcc998d3e13492d" }
> db.query_functions.find({"offset": 4243663,}, {"function_id": 1, "sample_id": 1})
{ "_id" : ObjectId("65951f011d9b2d99e3ee2786"), "function_id" : -19, "sample_id" : -1 }
{ "_id" : ObjectId("65951f181d9b2d99e3ee2b75"), "function_id" : -1023, "sample_id" : -2 }
{ "_id" : ObjectId("65951f201d9b2d99e3ee2f64"), "function_id" : -2027, "sample_id" : -3 }
{ "_id" : ObjectId("65951f281d9b2d99e3ee3353"), "function_id" : -3031, "sample_id" : -4 }

So the approach would be to first query based on timestamp filter to get all the sample_ids in question and then remove entries with that sample_id from both collections (query_samples, query_functions):

> db.query_samples.find({"timestamp": {"$lt": "2024-01-03T08-47-20"}}, {"sha256": 1, "sample_id": 1, _id: 0})
{ "sample_id" : -1, "sha256" : "b5d03910de7d2f2ed1a95c5cafc899444451b75f62d13788adcc998d3e13492d" }
{ "sample_id" : -2, "sha256" : "b5d03910de7d2f2ed1a95c5cafc899444451b75f62d13788adcc998d3e13492d" }
> db.query_samples.remove({"sample_id": {"$in": [-1, -2]}})
WriteResult({ "nRemoved" : 2 })
> db.query_functions.remove({"sample_id": {"$in": [-1, -2]}})
WriteResult({ "nRemoved" : 2008 })

danielplohmann commented 9 months ago

Flow of action for a potential implementation, as discussed yesterday:

[x] use MinHashIndex._indexCallBack() to frequently check if a new job for cleanup should be scheduled. This could be done by fetching the last query cleanup timestamp from MongoDbStorage, then trigger a new clean job if necessary
- the lastQueryCleanupTimestamp field could be placed in the db.settings collection, it would then need a setter and getter like the two below (_getDbState, _getDbTimestamp) but publicly accessible.
[x] The method to perform the clean up should be placed in the Worker and be decorated as @Remote like this, which will make it callable in MinHashIndex due to the transparent job passthrough enabled by metaprogramming
[x] for the actual cleanup, MongoDbStorage.deleteSample(sample_id) can be used as this also works for query samples
[x] along with the query samples/functions, any jobs and their results tied to these samples should be deleted as well in order to free up space and avoid the becoming dangling. For this MinHashIndex.getQueueData can be used to get a list of jobs. getQueueData() is directly callable from within MinHashIndex due to inheritance. Methods of interest are getMatchesForUnmappedBinary and getMatchesForMappedBinary. Once a list of jobs has been obtained, they can be inspected for being is_finished state and e.g. their finished_at timestamp. If difference between this timestamp and our last cleanup has passed the desired threshold, the job can be deleted from the queue and the accompanying result can be deleted from gridFS. Right now, there is no method to delete results but I will add it with one of the next versions of MCRIT.

danielplohmann commented 8 months ago

Alright, this should fully work now. At least locally, it behaved as expected and removed query sample, their jobs and results if they were beyond the temporal cutoff. For SMDA reports, this was more tricky, as the sample's sha256 was not easily accessible. I now opted for parsing it out of the SMDA report with a regex and no deserialization but that seems to work. Feel free to play around with it and let me know if it works as intended!

danielplohmann / mcrit

Add TTL to query_* documents #59