Handle minimal Provenance information

bregeon commented 4 years ago

The idea is that we want a minimal set of information to be stored in some way that explains where a given set of IRFs comes from: i.e. provenance information by IVOA standards, like input DL2 files origin, cuts, and optimization target for these IRFs.

In ctapipe, this is handled by a dedicated provenance module that is called by each component/tool.

But as pyirf is meant to be self contained, a simpler mechanism might be setup.

Actually, we might want some kind interface that could use either the simple local provenance module (for users playing around) or the ctapipe official module when in full official production.

HealthyPear commented 4 years ago

A very easy way (used by the current master) is to just create secondary files (in that case 1 per particle type with the selected simulated events, and 1 containing a table of the final cuts used to create the IRFs.

Personally I don't think it's that bad to just output a single FITS or HDF5 file containing all the auxiliary/provenance information.

HealthyPear commented 4 years ago

Actually, it should not be forbidden to add HDUs to the final FITS file where the OGADF information will be encoded.

So we could also think of adding an HDU for each of the additional information we want to deliver (see issue #6 )

vuillaut commented 4 years ago

The idea is that we want a minimal set of information to be stored in some way that explains where a given set of IRFs comes from: i.e. provenance information by IVOA standards, like input DL2 files origin, cuts, and optimization target for these IRFs.

In ctapipe, this is handled by a dedicated provenance module that is called by each component/tool.

But as pyirf is meant to be self contained, a simpler mechanism might be setup.

Actually, we might want some kind interface that could use either the simple local provenance module (for users playing around) or the ctapipe official module when in full official production.

The provenance should be handled at the pipeline level, e.g. using the ctapipe module, or any other mechanism. As we decided that pyIRF would be an independent library, called by such pipeline, I don't think we should be concerned by the provenance here.

NB: if provenance info is added later to the OGADF, then sure we will follow the evolution of the format

kosack commented 4 years ago

This provenance also has to include things like what were the inputs, outputs, etc. E.g. was this DL2 data from EventDisplay or ctapipe? Was it from Prod3 or Prod5? etc. What steps were applied to it? A minimal set is to support the CTA Reference Metadata (the same headers we put now in the DL1 files), but more detail will also be needed.

One other way is to use the LogProv system from Matthieu and Enrique, which is so far tested with gammapy - it allows one to attach provenance tracking information at the function-call level (so nice for user scripts that use the PyIRF system), usually by just adding some decorators.

see https://github.com/mservillat/logprov

vuillaut commented 4 years ago

This provenance also has to include things like what were the inputs, outputs, etc. E.g. was this DL2 data from EventDisplay or ctapipe? Was it from Prod3 or Prod5? etc. What steps were applied to it? A minimal set is to support the CTA Reference Metadata (the same headers we put now in the DL1 files), but more detail will also be needed.

One other way is to use the LogProv system from Matthieu and Enrique, which is so far tested with gammapy - it allows one to attach provenance tracking information at the function-call level (so nice for user scripts that use the PyIRF system), usually by just adding some decorators

My point exactly, and pyIRF has no way to know where these files come from and what was done with them prior to DL2, so the provenance should be dealt with at a higher level, no?

HealthyPear commented 4 years ago

I think we are confusing between 2 "provenances" here:

the simtel to DL2 provenance (which as @vuillaut says it's a pipeline matter)
the provenance produced by pyirf (like e.g. the final optimization cuts used to create the IRFs) which should be part of pyirf output (either together, or separated from the "pure" OGADF IRFs information)

mservillat commented 4 years ago

Indeed, to efficiently build the chain of provenance, ideally, each package has to provide its inputs/outputs and give information on the execution. Each dataset will have a dedicated identifier that is used to make the connection with the previous steps in the chain.

In the case of pyIRF, it might be interesting to adapt the logprov Python module (initially part of gammapy in a dev version, in connection with the high level interface). However, it may not be adjusted to the structure of pyIRF yet.

maxnoe commented 3 years ago

@mservillat This is exactly the kind of thing I want to avoid baking into pyirf right now.

We offer small, modular functions that do one thing, so any user (like lstchain, protopipe, future ctapipe tools, someone else) can choose their own config and provenance system, since now standard agreed upon solution exists.

cta-observatory / pyirf

Handle minimal Provenance information #39