FAST-HEP / fast-plotter

Manipulate binned pandas dataframes into plots
https://fast-hep.web.cern.ch
3 stars 8 forks source link

Fix errors, add features #26

Closed benkrikler closed 4 years ago

benkrikler commented 4 years ago

Several improvements:

Still to come:

codecov[bot] commented 4 years ago

Codecov Report

Merging #26 into master will not change coverage by %. The diff coverage is n/a.

Impacted file tree graph

@@          Coverage Diff           @@
##           master     #26   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files           7       7           
  Lines         707     772   +65     
======================================
- Misses        707     772   +65     
benkrikler commented 4 years ago

The under/ overflow bins can now be disabled / enabled from the config file. Adding:

no_over_underflow: False

to the config will draw the bins, setting it to True will hide them (which is the default).

benkrikler commented 4 years ago

The latest push adds the functionality to give each dataset a colour directly. In the config file, add a section called:

dataset_colours:
   # <dataset_name>: <colour_spec>
   ttbar: red  # use a named colour
   wz: "#3355c5" # Use a Hex string
   dy: [0.3, 0.4, 0.8] # Use a 3-tuple for RGB with numbers between 0 and 1

The colour specifications are directly handled by matplotlib, so they can be any of the methods described in https://matplotlib.org/3.1.0/tutorials/colors/colors.html.

Any datasets that are not given a colour will get their colour from the default colour map mechanism that has been used so far, this new method just allows us to override those colours.

eshwen commented 4 years ago

The under/ overflow bins can now be disabled / enabled from the config file. Adding:

no_over_underflow: False

to the config will draw the bins, setting it to True will hide them (which is the default).

This option doesn't seem stable. If I run over the dataframe tbl_dataset.ht--ht.csv.zip (zipped since csv files aren't supported in comments) with no_over_underflow: False, I get the following error message:

fast_plotter.plotting - INFO - Making 1D Projection: ht
fast_plotter.plotting - ERROR - Couldn't plot 1D projection: ht
Traceback (most recent call last):
  File "/home/hep/ebhal/chip_software/src/fast-plotter/fast_plotter/plotting.py", line 56, in plot_all
    figsize=figsize, **kwargs
  File "/home/hep/ebhal/chip_software/src/fast-plotter/fast_plotter/plotting.py", line 335, in plot_1d_many
    colourmap=colourmap, dataset_order=dataset_order)
  File "/home/hep/ebhal/chip_software/src/fast-plotter/fast_plotter/plotting.py", line 182, in actually_plot
    vals.apply(filler, axis=0, step="mid")
  File "/home/hep/ebhal/miniconda3/envs/chip_env/lib/python3.7/site-packages/pandas/core/frame.py", line 6928, in apply
    return op.get_result()
  File "/home/hep/ebhal/miniconda3/envs/chip_env/lib/python3.7/site-packages/pandas/core/apply.py", line 186, in get_result
    return self.apply_standard()
  File "/home/hep/ebhal/miniconda3/envs/chip_env/lib/python3.7/site-packages/pandas/core/apply.py", line 292, in apply_standard
    self.apply_series_generator()
  File "/home/hep/ebhal/miniconda3/envs/chip_env/lib/python3.7/site-packages/pandas/core/apply.py", line 321, in apply_series_generator
    results[i] = self.f(v)
  File "/home/hep/ebhal/miniconda3/envs/chip_env/lib/python3.7/site-packages/pandas/core/apply.py", line 112, in f
    return func(x, *args, **kwds)
  File "/home/hep/ebhal/chip_software/src/fast-plotter/fast_plotter/plotting.py", line 148, in __call__
    color=color, linewidth=width, where="mid", label=label, linestyle=style)
  File "/home/hep/ebhal/chip_software/src/fast-plotter/fast_plotter/plotting.py", line 435, in draw
    fill_val=fill_val, expected_xs=expected_xs)
  File "/home/hep/ebhal/chip_software/src/fast-plotter/fast_plotter/plotting.py", line 218, in standardize_values
    x, y_values = add_missing_vals(x, expected_xs, y_values=y_values, fill_val=fill_val)
  File "/home/hep/ebhal/chip_software/src/fast-plotter/fast_plotter/plotting.py", line 256, in add_missing_vals
    new[insert] = y
ValueError: ('NumPy boolean array indexing assignment cannot assign 20 input values to the 18 output values where the mask is true', 'occurred at index VH')
fast_plotter.plotting - ERROR - None
fast_plotter.plotting - ERROR - ('NumPy boolean array indexing assignment cannot assign 20 input values to the 18 output values where the mask is true', 'occurred at index VH')

The dataset "VH" doesn't exist in the dataframe but is in the dataset_order list in my plotting config (since it is general, and not all datasets will have entries in every dataframe). This hasn't been a problem before, and when I remove the no_over_underflow: False line from my config, it works fine

benkrikler commented 4 years ago

Thanks for letting me know. I'll take a look at the DFs tonight. Will also add the option to control the error calculation from the config.

benkrikler commented 4 years ago

This should now be fixed. The issue was really subtle: when we were replacing the infs of the under/overflow bins it was also modifying the list of values we expected to see (which is used to add in missing bins). The traceback you saw was an indirect consequence of this.

benkrikler commented 4 years ago

Last few changes:

  1. Fix the above errors
  2. Replace the no_over_underflow option to be a show_over_underflow option instead
  3. Pass through config parameters to control the error calculation method. Add err_from_sumw2: True to the config to enable using sum of squared weights as the variance, else variance will be given by (sum w)^2 / n.
eshwen commented 4 years ago

I think there's a bug when plotting over/underflow bins. Background processes plot fine but data does not plot. For example, the plot plot_dataset leadLepton_pt--lead_lepton_pt--weight_nominal--project_leadLepton_pt-yscale_log contains the overflow bin [350.0, inf) and is plotted for background but not for data. Checking the dataframe that was used to plot it tbl_dataset.leadLepton_pt--lead_lepton_pt.csv.zip (zipped because I can't include .csv files inline) shows that there are overflow bins for data (SingleElectron*) which contain events

benkrikler commented 4 years ago

Thanks for spotting that issue, it should now be solved in both the absolute yield plot and the ratio:

plot_dataset dimu_mass--DiMuonMass--weighted--project_dimu_mass-yscale_log