add pandas_diff for fast-carpenter output

FAST-HEP / scikit-validate

A validation package for science output

Other

2 stars 1 forks source link

add pandas_diff for fast-carpenter output #4

Open kreczko opened 5 years ago

kreczko commented 5 years ago

Imported from gitlab issue 4

@bkrikler Could you please send me some example output files?

kreczko commented 5 years ago

On 2019-03-04 Lukasz Kreczko (kreczko) wrote:

changed the description

kreczko commented 5 years ago

On 2019-03-04 Benjamin Krikler (bkrikler) wrote:

Thanks for putting this on an issue. Best bet for example outputs is the fast_cms_public_tutorial repository's pipeline, eg.: https://gitlab.cern.ch/fast-hep/public/fast_cms_public_tutorial/-/jobs/3492813/artifacts/browse/pipeline/carpenter/ (I've "kept" the job artifacts for that specific pipeline now).

kreczko commented 5 years ago

On 2019-03-04 Benjamin Krikler (bkrikler) wrote:

Also, I had a primitive set of pandas_diff-like tests running in the old FAST-RA1 project, which might help with this: https://gitlab.cern.ch/fast-cms/FAST-RA1/blob/master/tests/integrations/run_tests.py#L83-131. The tests there only checked for exact equality between two reloaded dataframes, but it might help provide a starting point for this. Although the rest of the code is pretty simple, so maybe it's not really adding anything for you...

kreczko commented 5 years ago

On 2019-03-04 Lukasz Kreczko (kreczko) wrote:

Thanks for the examples, this will be useful.

I am trying to get the diff into a similar shape to the ROOT version:

calculate KS & p-value for all 1D projections
display differing projections

For the current CSV files that's essentially identifying the category, variables & statistical data. I will have to think how to do this in a general way (like for ROOT) without being to verbose with the settings (e.g. `pandas_diff -c dataset, --var nMuon, nIsoMuons, -n n).

Maybe worth looking at fast-plotter for this?

kreczko commented 5 years ago

On 2019-03-04 Benjamin Krikler (bkrikler) wrote:

Yes, fast-plotter could be quite helpful for this. It depends a bit how generic / specific you want to be, however, i.e. is this a pandas-diff function, or a "fast binned dataframe"-diff? I think if it's the former it could be tricky to do this in some meaningful but general way, at least if the pandas dataframes are stored as CSV files (as binary files, you'd lose less info, like which columns are actually in the index). If you're comfortable being more specific to fast-carpenter's outputs then fast-plotter could be quite helpful, since it wraps reloading the CSV files, and gives utilities to project and sum, plus potentially plot the resulting differences.

kreczko commented 5 years ago

On 2019-03-04 Lukasz Kreczko (kreczko) wrote:

Yes, I am thinking more fast_binned_df_diff