Improve `pineappl plot` Python interaction

NNPDF / pineappl

PineAPPL is not an extension of APPLgrid

https://nnpdf.github.io/pineappl/

GNU General Public License v3.0

12 stars 3 forks source link

Improve `pineappl plot` Python interaction #249

Open alecandido opened 8 months ago

alecandido commented 8 months ago

Currently, pineappl plot relies on Python and Matplotlib. This is actually quite nice, because people using PineAPPL are often familiar with Python, and know how to plot the information, once given.

A current mild pain-point is that there is a lot of string manipulation involved in the call, since a Python script is generated everytime, embedding the data: https://github.com/NNPDF/pineappl/blob/3476629d40f4b4af164ce067ec2d029781be048c/pineappl_cli/src/plot.rs#L508-L521

However, dumping the script is a design featured, in such a way that people are able to customize it.

I see a few possible ways of improving:

split the two pineappl plot operations: dump the script and the data separately (in a common data format, like CSV)
leverage pineappl_py in the script, and directly extract the data from the grid (requires the installation of the PineAPPL python extension)
move the plotting feature to pineappl_py itself (possibly with a pineappl-plot CLI, but it would be installed separately, at least until #176 will be completed)
use PyO3 to make the call from the CLI (i.e. from Rust) - in this way people will be able to customize the script and pass it again, but there won't be any need of dumping the data any longer (most likely this is an overkill)

alecandido commented 8 months ago

This is not urgent at all, it is just a consideration that came to my mind reading #248, nothing more.

cschwan commented 8 months ago

I had similar considerations in the past, but

I like having the data and the script in one file, because then I can easily move/copy/rename the single file. This will become more complicated if we split things into more files.
Using pineappl_py in the Python plotting script is possible, but then we introduce two more dependencies: pineappl_py and LHAPDF. Especially the latter, particularly the Python version, is a pain in the neck to get working. Another downside is that the script will take much longer, which we probably want to avoid, given that we often work on the Python script to modify the plot, without having to regenerate the data.
Having the functionality available from Python we should definitely consider, although my plan was actually to eventually expose the entire CLI functionality in another, higher-level, crate, for which we then can write Python bindings, of course.
This suggestion I don't understand.

That being said I believe the plotting script(s) and functionality can be significantly improved. I was thinking of using a templating engine in such a way that the template is really kept in one file. This mostly is the case right now (in this file: https://github.com/NNPDF/pineappl/blob/master/pineappl_cli/src/plot.py), but writing out the data could be improved, for instance.

alecandido commented 8 months ago

Regarding the data handling, I see your point. However, I prefer to keep things separate, and rather gather through other means (e.g. folders or archives), since it makes it easier to consume them even for further purposes. Fine to cache the results anyhow, but it could be implemented really as cache: if this file exists, use it, otherwise recompute it. The file name would stem from the grid, with the option to manually pass a file.

The idea of moving to Python is really to avoid the template (with or without a suitable engine). This is because I appreciate more libraries than code generators (less copy-paste, version management, and similar benefits). They are even easier to test and debug, because they are native in the language they are written in (while you'd need to use Python to test the script generated from Rust).

Since this part of the CLI functionality is already mostly Python, I would expose the interface through Python. We could add the bindings for the rest in the same package of the plotting CLI.

P.S.: regarding point 4., the proposal was to write a Python function main(data, metadata), and make the call to main from Rust, through PyO3, such that the data are produced by Rust (with its own native data types) and marshalled to Python by PyO3, without the need of dumps in any file (not the source, nor anything else) - as I said, it's not needed, and it's also against your desiderata (i.e. caching)

cschwan commented 8 months ago

The idea of moving to Python is really to avoid the template (with or without a suitable engine). This is because I appreciate more libraries than code generators (less copy-paste, version management, and similar benefits). They are even easier to test and debug, because they are native in the language they are written in (while you'd need to use Python to test the script generated from Rust).

I think the crucial point that you dispute is whether we want to directly output the generated plot or a plotting script that in turn generates the plot (where the code sits is another question). I chose the latter for several reasons:

if the generated script is fine, I can run pineappl plot ... | python and directly get the plot; I don't need to bother with any additional files
however, very often, I want to modify the script a bit because because every plot has some quirks, and colleagues often have special wishes. In that case I can save the plotting script to a file, modify the file and run it to generate the plot. Alternatively, I can make the modifications automatically using tools that use the pipe (see @andreab1997's example here: https://github.com/NNPDF/papers/blob/master/nnpdf40mhou/plots/pheno/generate_pheno_plots.sh). Often, I have to iterate this several times until the plots look OK.

I don't think you can avoid having a code generator, unless you invent an API that sits on top of matplotlib, for instance (is that what you meant?). However, I don't see the added value of that - for me matplotlib IS the API to plot the data. Which advantages do you see?

alecandido commented 8 months ago

I don't think you can avoid having a code generator, unless you invent an API that sits on top of matplotlib, for instance (is that what you meant?). However, I don't see the added value of that - for me matplotlib IS the API to plot the data. Which advantages do you see?

Actually, this is the whole point, and nothing else.

Consider that partly you're already making such an API: https://github.com/NNPDF/pineappl/blob/e71b39acddaf2d09199fdb298f5d5afe38e771b2/pineappl_cli/src/plot.py#L104-L161 and the functions present have already enough arguments to be very customizable, without the need of patching them https://github.com/NNPDF/pineappl/blob/e71b39acddaf2d09199fdb298f5d5afe38e771b2/pineappl_cli/src/plot.py#L275-L284

The idea is that, if you want to do something very basic, your script could be a regular Python script, possibly with 3-4 lines only, importing pineappl.plot and just stating what do you want. Instead, if you want to do something more convolute, you can avoid seding a Python script, but rather make loops in Python and work with the data themselves (rather than their code representation). It's one layer less.

alecandido commented 8 months ago

The API would be just the script, maybe slightly rearranged, but no more. I.e.:

default constants
helper functions: just percent_diff and set_ylim
the main handler (that puts in the usual column the desired panels)
the frequent plots definitions
a data/metadata loader

Content of 2. and 3. are most likely always repeated. Instead, 1. is intended as a default, but the customization requires code manipulation, rather than passing different values. For 4. you would not be able to customize the code, but my guess is that you frequently don't need, since small modifications can be done by passing different values. If you instead need a quite different plot, you could make your own, even copying from the source one of the existing ones as an example (to me, it would be an improvement, but I expect this to rarely happen anyhow).

The last is just what we were discussing at the beginning.