cta-observatory / cta-lstchain

LST prototype testbench chain
https://cta-observatory.github.io/cta-lstchain/
BSD 3-Clause "New" or "Revised" License
24 stars 77 forks source link

Error handling in lstchain #271

Open labsaha opened 4 years ago

labsaha commented 4 years ago

Yesterday while trying to run dl0-dl1 scripts, we found a strange error. Finally, we could manage to understand that it was due to the presence of the partially generated DL1 file. When we deleted the file, we could run the script without any problem. In such cases, the error message should be very transparent which is not the case for the present version of the code. This is also required for our LSTOSA (LST-On-Site Analysis) development. Depending upon the error, we can decide to run the code again with associated steps to correct it automatically or raise some errors which require to be handled manually.

rlopezcoto commented 4 years ago

Hi @labsaha could you copy here to which error you are referring?

andres-baquero commented 4 years ago

 File "lstchain_data_r0_to_dl1.py", line 73, in main
    pointing_file_path=args.pointing_file_path
  File "/fefs/aswg/software/virtual_env/ctasoft/cta-lstchain/lstchain/reco/dl0_to_dl1.py", line 181, in r0_to_dl1
    write_metadata(metadata, output_filename)
  File "/fefs/aswg/software/virtual_env/ctasoft/cta-lstchain/lstchain/io/io.py", line 454, in write_metadata
    attribute = file.root._v_attrs[k].extend(metadata[k])
AttributeError: 'NoneType' object has no attribute 'extend'
labsaha commented 4 years ago

File "lstchain_data_r0_to_dl1.py", line 73, in main pointing_file_path=args.pointing_file_path File "/fefs/aswg/software/virtual_env/ctasoft/cta-lstchain/lstchain/reco/dl0_to_dl1.py", line 181, in r0_to_dl1 write_metadata(metadata, output_filename) File "/fefs/aswg/software/virtual_env/ctasoft/cta-lstchain/lstchain/io/io.py", line 454, in write_metadata attribute = file.root._v_attrs[k].extend(metadata[k]) AttributeError: 'NoneType' object has no attribute 'extend'

So you can see that it was not complaining about the presence of the DL1 file. When we deleted the DL1 files, we didn't get this error.

vuillaut commented 4 years ago

Hi @labsaha It seems from @andres-baquero comment that the file was corrupted but I don't understand the context.

Could you provide more information, please? What command did you run?

rlopezcoto commented 4 years ago

I do not think it is corrupted, this happens when you try to overwrite a file, no matter whether the existing one is corrupted or not, right @labsaha, @andres-baquero ?

labsaha commented 4 years ago

Ok, Let me explain a bit in detail. While producing the DL1 files using the standard command associated with the script dl0_to_dl1.py, the code failed due to the pointing issue we are discussing. However, it produced a DL1 file which is not the full DL1 file. When we fixed the pointing code and run the script it shows this error. This should clearly say that if we want to overwrite or by default it should overwrite without complaining. I hope I made it clear. If not, let me know.

andres-baquero commented 4 years ago

We ran like this:

python lstchain_data_r0_to_dl1.py -f /fefs/aswg/data/real/R0/20200118/LST-1.1.Run01832.0003.fits.fz -o /fefs/aswg/data-test/real/dl1/20200118 -pedestal /fefs/aswg/data/real/calibration/20200118/v01/drs4_pedestal.Run1830.0000.fits -calib /fefs/aswg/data/real/calibration/20200118/v01/calibration.Run1831.0000.hdf5 -time_calib /fefs/aswg/data/real/calibration/20200118/v01/time_calibration.Run1831.0000.hdf5 -pointing /fefs/home/lapp/DrivePositioning/drive_log_20_01_20.txt

andres-baquero commented 4 years ago

Some more information: If I run the script again without deleting the DL1 file which is not corrupted then we get this error message:

  File "lstchain_data_r0_to_dl1.py", line 78, in <module>
    main()
  File "lstchain_data_r0_to_dl1.py", line 73, in main
    pointing_file_path=args.pointing_file_path
  File "/fefs/aswg/software/virtual_env/ctasoft/cta-lstchain/lstchain/reco/dl0_to_dl1.py", line 215, in r0_to_dl1
    write_array_info(event, output_filename)
  File "/fefs/aswg/software/virtual_env/ctasoft/cta-lstchain/lstchain/io/io.py", line 350, in write_array_info
    append=True
  File "/fefs/aswg/software/virtual_env/anaconda3/envs/cta/lib/python3.7/site-packages/astropy/table/connect.py", line 114, in __call__
    registry.write(instance, *args, **kwargs)
  File "/fefs/aswg/software/virtual_env/anaconda3/envs/cta/lib/python3.7/site-packages/astropy/io/registry.py", line 566, in write
    writer(data, *args, **kwargs)
  File "/fefs/aswg/software/virtual_env/anaconda3/envs/cta/lib/python3.7/site-packages/astropy/io/misc/hdf5.py", line 308, in write_table_hdf5
    compatibility_mode=compatibility_mode)
  File "/fefs/aswg/software/virtual_env/anaconda3/envs/cta/lib/python3.7/site-packages/astropy/io/misc/hdf5.py", line 323, in write_table_hdf5
    raise OSError("Table {0} already exists".format(path))
OSError: Table /instrument/subarray/layout already exists
vuillaut commented 4 years ago

I understand now, thanks. Well, partial production or writing of a file is not supported at the moment.

kosack commented 4 years ago

In my scripts, I add an "overwrite" option, so that the user can choose to either overwrite, or fail in the case a file is there already:

     output_path = Path(expandvars(self.output_filename)).expanduser()
     if output_path.exists() and self.overwrite:
            self.log.warning(f"Overwriting {output_path}")
            output_path.unlink()
labsaha commented 4 years ago

Apart from this, I am also thinking about assigning a code number associated with errors. It will allow us to know quickly what is the problem. For example, if a run finishes without an error, it should return 0. If it finishes with an error it returns with something like xx or so on. It will help us to debug quickly when the data reduction pipeline runs automatically. @kosack @vuillaut @rlopezcoto What do you think on this? Or the same implementation can be taken care differently. Any idea or solution?

kosack commented 4 years ago

I'd suggest (anticipating the upcoming CTA Programming Standards Doc) that you use the same return codes as in the Linux errorno.h header file (0=success, positive are defined errors), and then use negative values for any custom statuses. That way, you are at least somewhat consistent. However, you should still catch exceptions and issue a nice error message.

Note that if you use the ctapipe.core.Tool, you get some standard behavior when a tool finishes normally, is interrupted (by ctrl-C for example), or catches an exception somewhere (though right now it doesn't use return codes, but it should!)

vuillaut commented 4 years ago

Thank you @kosack. I think what you mention here will be covered by ctapipe DL1 writer, right?

kosack commented 4 years ago

Oh, I should say also that the Provenence information in ctapipe tells you something about the status of output files as well. E.g. if I run a Tool and interrupt it, I get a provenance entry like:

    ...
      "status": "interrupted",
      "duration_min": 0.09620000000007067
   }

That's from the DL1 writer (and anything inheriting from Tool)

kosack commented 4 years ago

I think what you mention here will be covered by ctapipe DL1 writer, right?

Yes

vuillaut commented 4 years ago

I think what you mention here will be covered by ctapipe DL1 writer, right?

Yes

Then I would personally suggest making a quick fix concerning the current issue but not throwing a lot of effort in something that will be replaced in the very near future.

labsaha commented 4 years ago

@kosack @vuillaut Thanks!. Yes, it will be very good if we go for a quick fix on this issue.

rlopezcoto commented 4 years ago

is anybody working on this fix within lstchain? @labsaha, @vuillaut? One quick solution could be that proposed by @yrenier, in which you check whether the file exists and throw an error if it does:

if os.path.exists(output_filename):
    raise AttributeError(output_filename + ' exists, exiting.')
labsaha commented 4 years ago

I am not working on this and cannot work on this in the future. Yes, as a quick solution what @yrenier suggested can be considered.