Closed yankovs closed 8 months ago
I see that query_samples
documents already have a timestamp
key so TTL shouldn't be a problem (using the builtin feature for TTLs or otherwise) but query_functions
is a different story. Should a timestamp be appended to it as well? 🤔
Hey!
Functions are not supposed to exist on their own (i.e. without a sample associated with them), therefore the way to go would be to delete all functions based on sample_id from query_functions
whenever a sample is removed from query_samples
I see. So the preferred way would be to create a manual service that deletes a sample from query_samples
and its associated functions?
How would that work if a sample appears multiple times in query_samples
? Would the id be the same? because if not, how can you know what function is for this specific entry of this function in this collection and not another one?
Yes, it would be the same, there is a unique relationship between a sample_id
and the associated function_ids
, functions are never shared across sample_ids
.
So a delete of the functions from a specific sample will delete the functions for any other instance of this sample in query_samples
?
Hey!
No, each instance in query_samples
will have their own set of functions in query_functions
, at least if you submitted with the force_recalculation=True
parameter:
> db.query_samples.find({}, {"sha256": 1, "sample_id": 1})
{ "_id" : ObjectId("65951f011d9b2d99e3ee2773"), "sample_id" : -1, "sha256" : "b5d03910de7d2f2ed1a95c5cafc899444451b75f62d13788adcc998d3e13492d" }
{ "_id" : ObjectId("65951f181d9b2d99e3ee2b62"), "sample_id" : -2, "sha256" : "b5d03910de7d2f2ed1a95c5cafc899444451b75f62d13788adcc998d3e13492d" }
{ "_id" : ObjectId("65951f1f1d9b2d99e3ee2f51"), "sample_id" : -3, "sha256" : "b5d03910de7d2f2ed1a95c5cafc899444451b75f62d13788adcc998d3e13492d" }
{ "_id" : ObjectId("65951f271d9b2d99e3ee3340"), "sample_id" : -4, "sha256" : "b5d03910de7d2f2ed1a95c5cafc899444451b75f62d13788adcc998d3e13492d" }
> db.query_functions.find({"offset": 4243663,}, {"function_id": 1, "sample_id": 1})
{ "_id" : ObjectId("65951f011d9b2d99e3ee2786"), "function_id" : -19, "sample_id" : -1 }
{ "_id" : ObjectId("65951f181d9b2d99e3ee2b75"), "function_id" : -1023, "sample_id" : -2 }
{ "_id" : ObjectId("65951f201d9b2d99e3ee2f64"), "function_id" : -2027, "sample_id" : -3 }
{ "_id" : ObjectId("65951f281d9b2d99e3ee3353"), "function_id" : -3031, "sample_id" : -4 }
So the approach would be to first query based on timestamp filter to get all the sample_ids
in question and then remove entries with that sample_id
from both collections (query_samples
, query_functions
):
> db.query_samples.find({"timestamp": {"$lt": "2024-01-03T08-47-20"}}, {"sha256": 1, "sample_id": 1, _id: 0})
{ "sample_id" : -1, "sha256" : "b5d03910de7d2f2ed1a95c5cafc899444451b75f62d13788adcc998d3e13492d" }
{ "sample_id" : -2, "sha256" : "b5d03910de7d2f2ed1a95c5cafc899444451b75f62d13788adcc998d3e13492d" }
> db.query_samples.remove({"sample_id": {"$in": [-1, -2]}})
WriteResult({ "nRemoved" : 2 })
> db.query_functions.remove({"sample_id": {"$in": [-1, -2]}})
WriteResult({ "nRemoved" : 2008 })
Flow of action for a potential implementation, as discussed yesterday:
db.settings
collection, it would then need a setter and getter like the two below (_getDbState
, _getDbTimestamp
) but publicly accessible.@Remote
like this, which will make it callable in MinHashIndex due to the transparent job passthrough enabled by metaprogramminggetQueueData()
is directly callable from within MinHashIndex due to inheritance. Methods of interest are getMatchesForUnmappedBinary
and getMatchesForMappedBinary
. Once a list of jobs has been obtained, they can be inspected for being is_finished
state and e.g. their finished_at
timestamp. If difference between this timestamp and our last cleanup has passed the desired threshold, the job can be deleted from the queue and the accompanying result can be deleted from gridFS. Right now, there is no method to delete results but I will add it with one of the next versions of MCRIT.Alright, this should fully work now. At least locally, it behaved as expected and removed query sample, their jobs and results if they were beyond the temporal cutoff. For SMDA reports, this was more tricky, as the sample's sha256 was not easily accessible. I now opted for parsing it out of the SMDA report with a regex and no deserialization but that seems to work. Feel free to play around with it and let me know if it works as intended!
In an automated system, the insert to the query_* collections during a query grows very large very quick. After a couple of weeks or months since query have passed the MCRIT db has probably changed so this sample will probably require a re-query anyway. So saving old results is not that useful.
It would be nice it there was an option to turn on TTL and just remove such query-related data after some user-defined period