Provenance - Githubissues

dneise commented 6 years ago

Brainstorming!

What is it? Do we need it? What to expect from it? how to use it?

Ring is open ... fight :-D

calispac commented 6 years ago

Just that we agree on definition:

Provenance = Metadata allowing to retrieve the analysis that was performed for an analysis output?

Should contain :

a list of analysis steps
version of the code
date of execution
input data provenance (which zfits files)
config of the camera/telescope :in lab, at site, shutter closed/opened, etc. (in principle part of zfits header)

Ideally one could recreate the analysis in a few lines of code e.g.:

   results_from_analysis = load('awesome_results')
   anaylsis_steps = provenance.recreate_analysis(results_from_analysis)
   for step in analysis_steps:
        print(step)

Which would print something like:

0 : read_zfits config={'url':'awesome_data.zfits'}
1 : baseline_sub
2 : compute_charge_integral
3 : make_a_histo config = {'n_bins': 100}

Then you want to rerun the analysis with maybe different config of step 3 or a different version of an analysis step.

   results_from_analysis = load('awesome_results')
   anaylsis_steps = provenance.recreate_analysis(results_from_analysis)
   analysis_steps[3] = make_a_better_histo(config={'n_bins':1000})
   for step in analysis_steps:

         step.run()

dneise commented 6 years ago

Thank you Cyril for this. I understood Provenance similarly, when reading what the ctapipe people were saying. I thought a bit about the concept and I think it works, when you have users which are not writing code.

The example for this concept of provenance are some image processing suites like Adobe Lightroom or so. They typically store the entire provenance of you working on an image .. and when you are happy with the result, you can apply the exact same steps to the other 1000 images you have taken at the same day of the same sujet.

So far so good.

Now ... for this thing to work "The User" must be unable to perform steps on the image, which are not or cannot be reflected in this "history of steps applied to the image".

For example .. let us assume our typical example "pipeline_crab" looked like this: https://github.com/cta-sst-1m/digicampipe/blob/d866b50c6bcc46f4b40792fc7a333d8c5a3efcc6/digicampipe/processors/pipeline_crab.py#L35-L67

This is an example from the "Processor proposal" #119 ... but never mind that ... important is, that the entire process is defined by this list of processors, i.e. by their order and their settings.

We can image, that each of these processors, can store itself, including its settings somehow to a file .. so we can later apply the exact same process to another input file. Let us image these processors, are all instances of a class called "processor" and they all somehow inherit this provenance feature from their parent class....

Now ... Python is a dynamic language ... Nothing prevents any user from putting this, into the process:

        proc.filters.FilterShower(minimal_number_of_photons),
        proc.calib.dlw.CalibrateT0DL2(
            reclean=True,
            shower_distance=200 * u.mm
        ),
        my_funny_object_multiply_size_with_1.2(),
        proc.io.HillasToText(output_filename)
]

In order for this to work, the user must only let my_funny_object have exactly the same interface as the other "processors" in the list. In the current master, all "processors" have this interface:

be a function
accept a stream, and maybe other parameters
yield events

In #119 the interface is equally easy:

be callable, i.e. basically be a function
accept exactly one parameter: "event"
return event

It is extremely easy to implement this interface without inheriting from any parent class. So any user who does this, will mess up the Provenance without even realizing it.

So .. to sum it up. I think implementing something like Provenance is hard... I mean .. to do it right is hard. I think it is worth it, when you are adobe and have 10 million users. I think messing Provenance up is easy.

So I doubt it is worth to implement it.

Instead .. we should employ a certain workflow, when doing an analysis. This workflow needs no extra code. The "user" can to follow this workflow or not. I am going to sketch this workflow in the next comment.

dneise commented 6 years ago

For repeatability to work, taking the human factor out of the question is very important.

For a new analysis proceed as follows:

start a new empty git repo and clone it to your machine
write an install.sh script, that reflects how to download and install all the programs you are going to need for your analysis.
Only use explicit release versions for your analysis. Do not clone the master of anything; install a specific release.
If the feature you need for your analysis has not been released yet, you cannot do this analysis. (Hence: release early, release often)
Write all the actions you need to perform your analysis into a Makefile, so that performing the analysis is done by typing make.
add, commit and push constantly until a certain result is reached. Freeze the repo at this point, use a new repo for the next analysis.

I do not say, that I always follow this workflow (I try to, but I often fail).

no command line parameters

The analysis you are writing is not supposed to be configurable. It is not a program, it is a scientific document. So event when you analysis script my_analysis.py has no command line parameters apart from the run(s) you are analysing, you still need to write a Makefile like this:

all:
    python my_analysis.py an_explicit_path_to_the_input_files

So that everybody who reads the analysis knows exactly what the input was.

In principle all the input files belong into the repo, but they are often too large. So we have to trust our collaborations, that these files will be stored somewhere forever, so this analysis we are doing is repeatable.

Other files like config files should go into the analysis repo if you are unsure if future colleagues will find them still in 2050.

If your analysis is creating results, which are supposed to be further analyzed. Say you are creating DL2 files. Then make sure the output folder contains at least a little readme file with a link to the repo, which contains and describes your analysis.

I think that's it ...

You see .. this is just my opinion about this provenance topic. I might as well be totally wrong... Or we might simply want to try this provenance stuff and see how it goes.

dneise commented 6 years ago

Are there any more people with opinions about this matter? @moderski ? or @Etienne12345 or so?

cta-sst-1m / digicampipe

Provenance #122

no command line parameters