Closed marxide closed 2 years ago
Possible solutions:
Run
given that the only time Run.delete()
is called is on single run instances in the clearpiperun
management command. Still, I hesitate to use a solution that might bite us later.pre_delete
signal is sent. This should achieve the same result as above but will work with both single instances and QuerySets. Django signals aren't sent when using the special bulk_*
methods, but there isn't a bulk_delete
so that's not an issue here.It would be good if there was some control over it perhaps? An option in clearpiperun
that allows to remove all images or not? I can't decide if that would be necessary/needed.
Perhaps a separate issue but does the image need an upload successful column such that image objects with no parquets written are avoided? Then you would know that there should be no dodgy images in the original upload. Or if there is it will be redone when checking the images.
- seems like the best option to clean it all up, so something like this?: https://stackoverflow.com/a/26546181
Yes, except I think avoiding iterating over each image in the run would be better so we're not issuing one DELETE statement per image. Something like
Image.objects.filter(run=run_to_delete).annotate(num_runs=Count("run")).filter(num_runs=1).delete()
It would be good if there was some control over it perhaps? An option in
clearpiperun
that allows to remove all images or not? I can't decide if that would be necessary/needed.
The only case I can think of where that would be helpful is if we wanted to keep the images for a future run to skip the ingest. I'm not sure how we'd implement that with signals though since we don't control the signal being sent, only what happens when it's received. As far as I know, we can't pass an argument to tell the signal receiver to not delete the images.
Perhaps a separate issue but does the image need an upload successful column such that image objects with no parquets written are avoided? Then you would know that there should be no dodgy images in the original upload. Or if there is it will be redone when checking the images.
Perhaps. The other side of this issue is that the pipeline thought the image had been ingested successfully based solely on the fact that an image object for that file was in the database. We could address that by fixing the condition that is checked, e.g. instead of just checking that an image object exists, also check that the measurements parquet file is there.
Yes, except I think avoiding iterating over each image in the run would be better so we're not issuing one DELETE statement per image. Something like
Image.objects.filter(run=run_to_delete).annotate(num_runs=Count("run")).filter(num_runs=1).delete()
👍
The only case I can think of where that would be helpful is if we wanted to keep the images for a future run to skip the ingest. I'm not sure how we'd implement that with signals though since we don't control the signal being sent, only what happens when it's received. As far as I know, we can't pass an argument to tell the signal receiver to not delete the images.
Yeah on second thought it could cause issues to have such an option, delete should delete the images.
Perhaps. The other side of this issue is that the pipeline thought the image had been ingested successfully based solely on the fact that an image object for that file was in the database. We could address that by fixing the condition that is checked, e.g. instead of just checking that an image object exists, also check that the measurements parquet file is there.
That's what I meant really, just on the database side so you don't have to go checking the files on the system. But either works.
An
Image
object may be associated with multipleRun
objects. When a run is deleted, the associated images should also be deleted, but only if they are not associated with any other runs. This behaviour is specified in the documentation but doesn't occur. When a run is deleted, it can leave behind images that are not associated with any run. In some cases, that doesn't matter, it will even benefit future runs that use that image by allowing the pipeline to skip ingesting the data. However, it is an issue when dealing with failed runs that need to be deleted and run again. During the second run attempt, the pipeline will see that an included image already exists in the database and skip ingest, even in cases where the original ingest of that image failed and was incomplete, e.g. theImage
object was created, but the measurements parquet file was never written.There are some other many-to-many relationships that should be looked at for the same issue:
SkyRegion.run
is a many-to-many relationship toRun
. After deleting the last run associated with a sky region, theSkyRegion
object will be left behind. This likely isn't an issue given how light theSkyRegion
object is but we should address it for consistency.Source.related
is a many-to-many relationship toSource
. I don't think this one will have the same problem since the model is related to itself.