Make directory cleanup more robust

clelange commented 2 years ago

See discussion in #193

I support making this overall more robust. However, I think that tacking on individual checks for specific edge cases might not be the best approach as we might still miss something. How about instead, we generally force the user to use a previously nonexistant directory the first time they execute create_files. That would require us to preserve between runs the knowledge of whether or not a given directory was originally created by hepdata_lib. We could easily accomplish this by depositing an empty signifier file (e.g. $DIRECTORY/.created_by_hepdata_lib) in the desired directory. Each time it is run, create_files would check whether the output directory exists already and whether the signifier file exists. If the directory does not yet exist, we proceed as normal with creating the directory as well as the signifier file. If the directory exists, but the file does not, we exit and give the user a warning telling them that they should use a dedicated empty directory in order to avoid trouble.

Right this moment, though, we have code published on pypi that can accidentally wipe user files with default settings. Therefore, let's please merge this hot fix and mint a new version. That buys us a little time to think through how to really fix this once and for all.

_Originally posted by @AndreasAlbert in https://github.com/HEPData/hepdata_lib/issues/193#issuecomment-1011947711_

AndreasAlbert commented 2 years ago

The alternative would be to actively track individual files we have created and only ever delete those. Basically, whenever we create an output file, we would write its name into a persistent storage file in the output directory. Then, when deletion time comes, we only delete whatever was in that file.

clelange commented 2 years ago

The alternative would be to actively track individual files we have created and only ever delete those. Basically, whenever we create an output file, we would write its name into a persistent storage file in the output directory. Then, when deletion time comes, we only delete whatever was in that file.

Tracking all this would be very involved. I see too many failure scenarios, e.g. the user deleting the file that does the tracking, the execution of the script being cancelled while running etc.

I would prefer to proceed as I wrote in #193:

do not allow the output directory to be the same as the directory the python files reside (and the script directory must also not be a subdirectory of the output directory)
in addition, we could also force that a nonexistent directory is used as output the first time the script is run, which could be checked (as suggested above) by creating a file $DIRECTORY/.created_by_hepdata_lib or similar, but I'm not sure how robust this would be.
Add additional confirmation when directory would be deleted, which could be overwritten by a --force-cleanup flag or similar

Maybe the relatively simple directory check would do since by default the cleanup is now disabled?

HEPData / hepdata_lib

Make directory cleanup more robust #194