HEPData / hepdata_lib

Library for getting your data into HEPData
https://hepdata-lib.readthedocs.io
MIT License
15 stars 39 forks source link

submission.create_files() fails and wipes the directory clean #192

Closed rmanzoni closed 2 years ago

rmanzoni commented 2 years ago

Hi!

this script based off the simple examples I found in the documentation, fails to write the submission file submission.create_files() and deletes all other files that reside in the same directory it is launched from.

import numpy as np
from hepdata_lib import Submission, Table, Variable
from gentau_pt_eff import raw_data # here https://gist.github.com/rmanzoni/801a8788a5c6b8db3b5d3cd6697097d5

submission = Submission()

table = Table("Figure 5")

tau_gen_vis_pt = Variable(
    "Efficiency times acceptance", 
    is_independent = True, 
    is_binned      = True, 
    units          = "GeV"
)
tau_gen_vis_pt.values = [ 
    ( 20., 22.), 
    ( 22., 24.), 
    ( 24., 26.), 
    ( 26., 28.), 
    ( 28., 30.), 
    ( 30., 35.), 
    ( 35., 40.), 
    ( 40., 50.), 
    ( 50., 60.), 
    ( 60., 70.), 
    ( 70., 80.), 
    ( 80., 90.),
    ( 90.,100.),
    (100.,150.),
    (150.,200.),
    (200.,500.),
]

table.add_variable(tau_gen_vis_pt)

efficiencies = dict()

nicer_names = {
    'dmw2p' :'decay mode',
    'dmwo2p':'decay mode w/o 2-prong',
    'loose' :'deeptau vs jet loose',
    'medium':'deeptau vs jet medium',
    'tight' :'deeptau vs jet tigt',
}

# dependent variables y-axis
for ik in raw_data.keys():
    efficiencies[ik] = Variable("Efficiency %s" %nicer_names[ik], is_independent = False, is_binned = False, units = "")
    efficiencies[ik].values = raw_data[ik]['values']
    table.add_variable(efficiencies[ik])

submission.add_table(table)
submission.create_files()

I guess this is not the intended behaviour...

this is the error it rreturns

In [2]: submission.create_files()
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-2-7ea7852cac01> in <module>
----> 1 submission.create_files()

~/opt/anaconda3/envs/hammer3p8/lib/python3.8/site-packages/hepdata_lib/__init__.py in create_files(self, outdir, validate, remove_old)
    552         """
    553         if remove_old and os.path.exists(outdir):
--> 554             shutil.rmtree(outdir)
    555
    556         if not os.path.exists(outdir):

~/opt/anaconda3/envs/hammer3p8/lib/python3.8/shutil.py in rmtree(path, ignore_errors, onerror)
    735             # can't continue even if onerror hook returns
    736             return
--> 737         return _rmtree_unsafe(path, onerror)
    738
    739 # Allow introspection of whether or not the hardening against symlink

~/opt/anaconda3/envs/hammer3p8/lib/python3.8/shutil.py in _rmtree_unsafe(path, onerror)
    617         os.rmdir(path)
    618     except OSError:
--> 619         onerror(os.rmdir, path, sys.exc_info())
    620
    621 # Version using fd-based APIs to protect against races

~/opt/anaconda3/envs/hammer3p8/lib/python3.8/shutil.py in _rmtree_unsafe(path, onerror)
    615                 onerror(os.unlink, fullname, sys.exc_info())
    616     try:
--> 617         os.rmdir(path)
    618     except OSError:
    619         onerror(os.rmdir, path, sys.exc_info())

OSError: [Errno 22] Invalid argument: '.'

In [3]:
AndreasAlbert commented 2 years ago

Thanks for reporting @rmanzoni This is indeed an unintended conseuqnece. IMO, we should never delete files unless the user actively asks for it. This deletion happens because create_files has an argument remove_old, which currently defaults to true https://github.com/HEPData/hepdata_lib/blob/master/hepdata_lib/__init__.py#L541

This should be changed to defaulting to false, so that a user would have to explciitly call submission.create_files(....,remove_old=True) in order for any deletion to take place.

rmanzoni commented 2 years ago

Hi Andreas, thanks for the quick support! Just to avoid misunderstandings, the problem I observed is that all files that lived in the same directory as that of the hepdata script got deleted by it. Even those that had nothing to do with it. If it just deleted the existing hepdata outdir or the submission file that would've been minor. So I guess perhaps just changing the default is not enough, as it would reproduce the same bad behaviour when the user sets it to true.

AndreasAlbert commented 2 years ago

We currently do not have a mechanism to track what files are created by hepdata-lib. Therefore, the straightforward implementation of remove_old just removes the entire output directory. The entire reasoning here is that unrelated files in the output directory, which then go into the submission tar ball, will lead the hepdata validator to fail. For each file in the tar ball, it checks whether the file is referred to in the submission.yaml file. If not, it fails, arguing that something likely went wrong. Therefore, full deletion of the output directory is sensible in principle. However, it should of course not delete unrelated files that are outside of its scope. It seems to me that changing the default output direcotry to something like ./hepdata_output/ would solve this. After all, a sensible user would not store unrelated in such a specifically named output directory. Of course the documentation of remove_old should make obvious to the user that the entire folder is deleted, not just individual files.

rmanzoni commented 2 years ago

I still have the impression that I didn't explain myself clear: I attach a screenshot of the directory that contains the plot I'm trying to include in the submission object using the hepdata.py script that I've reported above.

When it gets to submission.create_files() it deletes the entire content of the directory on display, not just output. And then fails with the message I reported.

I would've been ok if it removed just output and its content. I agree it does not make sense to add spurious files in there.

Schermata 2022-01-13 alle 11 09 05
AndreasAlbert commented 2 years ago

Does the screenshot match the code you shared above? In the code you posted above, you call create_files() without any arguments. This means that the default output directory is used, namely outdir=".", i.e. the directory you are currently in. The code then tries to trigger deletion of outdir. Based on the code you shared, I do not know what the output directory is in the screenshot.

rmanzoni commented 2 years ago

ok, now I understand. Indeed it's called w/o arguments, so iiuc it defaults to '.', hence the problem I experienced. I confirm that when I pass a different directory name all works as it should. Thanks!

clelange commented 2 years ago

Thanks for reporting and providing further details. hepdata_lib v0.10.1 that contains the hotfix is on PyPI. I've created #194 for discussion of a better/robust solution.