Objects with many-to-many relationships to Run are not deleted when deleting runs

marxide commented 2 years ago

An Image object may be associated with multiple Run objects. When a run is deleted, the associated images should also be deleted, but only if they are not associated with any other runs. This behaviour is specified in the documentation but doesn't occur. When a run is deleted, it can leave behind images that are not associated with any run. In some cases, that doesn't matter, it will even benefit future runs that use that image by allowing the pipeline to skip ingesting the data. However, it is an issue when dealing with failed runs that need to be deleted and run again. During the second run attempt, the pipeline will see that an included image already exists in the database and skip ingest, even in cases where the original ingest of that image failed and was incomplete, e.g. the Image object was created, but the measurements parquet file was never written.

There are some other many-to-many relationships that should be looked at for the same issue:

SkyRegion.run is a many-to-many relationship to Run. After deleting the last run associated with a sky region, the SkyRegion object will be left behind. This likely isn't an issue given how light the SkyRegion object is but we should address it for consistency.
Source.related is a many-to-many relationship to Source. I don't think this one will have the same problem since the model is related to itself.

marxide commented 2 years ago

Possible solutions:

Override the delete method on the appropriate models to clean up orphaned m2m instances. The main problem with this solution is the delete method is only called when deleting single instances, not when used with QuerySets (i.e. deleting multiple instances at once) as rather than calling the model delete method, it issues a SQL DELETE statement. However, this should still solve this issue with Run given that the only time Run.delete() is called is on single run instances in the clearpiperun management command. Still, I hesitate to use a solution that might bite us later.
Clean up the ophaned m2m instances when a pre_delete signal is sent. This should achieve the same result as above but will work with both single instances and QuerySets. Django signals aren't sent when using the special bulk_* methods, but there isn't a bulk_delete so that's not an issue here.

ajstewart commented 2 years ago

seems like the best option to clean it all up, so something like this?: https://stackoverflow.com/a/26546181

It would be good if there was some control over it perhaps? An option in clearpiperun that allows to remove all images or not? I can't decide if that would be necessary/needed.

Perhaps a separate issue but does the image need an upload successful column such that image objects with no parquets written are avoided? Then you would know that there should be no dodgy images in the original upload. Or if there is it will be redone when checking the images.

marxide commented 2 years ago

seems like the best option to clean it all up, so something like this?: https://stackoverflow.com/a/26546181

Yes, except I think avoiding iterating over each image in the run would be better so we're not issuing one DELETE statement per image. Something like

Image.objects.filter(run=run_to_delete).annotate(num_runs=Count("run")).filter(num_runs=1).delete()

It would be good if there was some control over it perhaps? An option in clearpiperun that allows to remove all images or not? I can't decide if that would be necessary/needed.

The only case I can think of where that would be helpful is if we wanted to keep the images for a future run to skip the ingest. I'm not sure how we'd implement that with signals though since we don't control the signal being sent, only what happens when it's received. As far as I know, we can't pass an argument to tell the signal receiver to not delete the images.

Perhaps a separate issue but does the image need an upload successful column such that image objects with no parquets written are avoided? Then you would know that there should be no dodgy images in the original upload. Or if there is it will be redone when checking the images.

Perhaps. The other side of this issue is that the pipeline thought the image had been ingested successfully based solely on the fact that an image object for that file was in the database. We could address that by fixing the condition that is checked, e.g. instead of just checking that an image object exists, also check that the measurements parquet file is there.

ajstewart commented 2 years ago

Yes, except I think avoiding iterating over each image in the run would be better so we're not issuing one DELETE statement per image. Something like
Image.objects.filter(run=run_to_delete).annotate(num_runs=Count("run")).filter(num_runs=1).delete()

👍

The only case I can think of where that would be helpful is if we wanted to keep the images for a future run to skip the ingest. I'm not sure how we'd implement that with signals though since we don't control the signal being sent, only what happens when it's received. As far as I know, we can't pass an argument to tell the signal receiver to not delete the images.

Yeah on second thought it could cause issues to have such an option, delete should delete the images.

Perhaps. The other side of this issue is that the pipeline thought the image had been ingested successfully based solely on the fact that an image object for that file was in the database. We could address that by fixing the condition that is checked, e.g. instead of just checking that an image object exists, also check that the measurements parquet file is there.

That's what I meant really, just on the database side so you don't have to go checking the files on the system. But either works.

askap-vast / vast-pipeline

Objects with many-to-many relationships to Run are not deleted when deleting runs #584