NNPDF / pineappl

PineAPPL is not an extension of APPLgrid
https://nnpdf.github.io/pineappl/
GNU General Public License v3.0
13 stars 3 forks source link

Improve the CLI's documentation of the `remap` subcommand #89

Closed felixhekhorn closed 2 years ago

felixhekhorn commented 2 years ago

To continue the APPL comparison I will need to renormalize the grids - I saw that the remap command seems to be able to do such a thing, but I don't know how to use it; i.e. what is the "remap string" - can you provide us with an example?

cschwan commented 2 years ago

The primary purpose of remap is to map a one-dimensional distribution to an n-dimensional one (many Monte Carlo programs only produce one-dimensional distributions, and remap is our way to work around it). I'm not quite sure this is the proper use in this case, but let's try it anyway to see what you need.

You can find an example here:

https://github.com/NNPDF/runcards/blob/a48e5d4c7c6d2dae36f0a0228a11e4739f87f7be/nnpdf31_proc/ATLAS_DY_8TEV_2D_0116_0200/postrun.sh#L3

What it does is to completely ignore whatever bin limits you previously set, and apply new ones, based on a cartesian product. In the case above, '116,150,200,...;0,0.2,0.4,...' produces the two-dimensional bin limits in the following order:

The pipe | is used to modify the limits for the second observable for specific values of the first observable. If you count the number of fields divided by the pipe you'll find five, one field for each of the five bins of the first observable. An empty field just means the field previously specified. In the example above that means specifically

By default remap divides the result of each bin by the bin-width (for each observable). If you don't want that, you can disable it for specific observables using --ignore-obs-norm index where index is the index of the observable starting from 0. An example for that you can find here,

https://github.com/NNPDF/runcards/blob/master/nnpdf31_proc/ATLAS_DY_8TEV_3D_046080_0007/postrun.sh

where only the bin width for the first (0) observable is taken into account.

Finally, you can use --norm factor, which applies factor multiplicatively for all bins (for example for the femtobarn/picobarn conversion).

alecandido commented 2 years ago

Maybe @cschwan it might be an idea to extend CLI help with examples, at some point.

It's fine that help is the only docs for a CLI (unless you want to write a man page, clap-rs/clap#552), but I'd say examples saved my day so many times...

You could also consider to add a couple to https://tldr.sh/

cschwan commented 2 years ago

Commit 25fc84a413fb968b192ddd103448cc58756c3b5f adds additional documentation and even a few basic examples. Let me know what you think!

alecandido commented 2 years ago

Sorry not to have replied before. It looks good, just a small observation:

cschwan commented 2 years ago

The commit probably doesn't belong to any branch because I've merged it; what is being duplicated, where is the discussion?

alecandido commented 2 years ago

The duplication is simply in the test: since you edited the message, you had to duplicate the string in the test. Since this requires manual sync at any update, it is fragile (https://github.com/N3PDF/pineappl/pull/95#issuecomment-987631410).

cschwan commented 2 years ago

I think the main Issue here has been addressed, therefore I'll close it.

felixhekhorn commented 2 years ago
cschwan commented 2 years ago

Thanks for reviewing it, @felixhekhorn!

The | operation is exactly the + operation you described :smile:! But it's an operation that's needed because there's no combination of the other operations that lets you mimick |. Does that make sense, @felixhekhorn?

felixhekhorn commented 2 years ago

fine! the operations is to be read as "OR" ... maybe you could give this specific example?

cschwan commented 2 years ago

What I understand is that the documentation needs to improved for |, so I'm opening this Issue again.

I wouldn't call the | operation OR because the number of fields is fixed by the previous dimensions. For instance, if you have the following prescription:

x0,x1,x2,...,xn;y0,y1,...,ym;bls1|bls2|...|blsp

there must be p = n * m bin limit specifications (bls) separated by |, as in bls1|bls2|...|blsp. If you have more dimensions before a specification containing | there there must be as many bin limit specifications separated by | as the product of the number of bins for each dimension coming before |.

cschwan commented 2 years ago

You can also think of | as making the bin limits dependent on the previous dimensions. Without it the cartesian product makes sure that all the bin limits are independent of each other. A good example of that is ATLAS_DY_8TEV_3D, which has three dimensions,

but the bins of rapidity depend on the invariant mass and the angle (the previous dimensions). The further you go away from the Z peak at roughly 90.0 and the more extreme the angle is, the more rapidity bins are 'cut out' of the cartesian product 'cube'.

alecandido commented 2 years ago

I'm still missing something: in the example there is written

'0,1,2;0,1,2|0,2,4' expects a grid with a total of 4 bins and produces the following bin limits: (0-1;0-1), (0-1;1-2), (1-2;0-2), (1-2;2-4)

so I get that the 0,1,2 before | is referred to the 0,1 in the first section (before ;) and 0,2,4 is related to 1,2 in the first section.

Later on, there is the other example on syntactic sugar for repetition:

for six bins '0,1,2;0,1,2|0,1,2|0,2,4' can be more succinctly written as '0,1,2;0,1,2||0,2,4'

but in the case there are 2 bins in the first dimension (before ;), i.e. 0-1 and 1-2, while they're mapped over three different slices in the second dimension.

I'm confused, I'd have expected something like '0,1,2,3;0,1,2|0,1,2|0,2,4'...

alecandido commented 2 years ago

About the | operator: maybe you don't care that much, but in Python they assigned a meaning to that operator in the specific context of sets. It means union.

(Very recently they started using that operator even in the context of types, like in Typescript, but the one for actual sets existed at least from 3.0, if not before...)

cschwan commented 2 years ago

I'm still missing something: in the example there is written

'0,1,2;0,1,2|0,2,4' expects a grid with a total of 4 bins and produces the following bin limits: (0-1;0-1), (0-1;1-2), (1-2;0-2), (1-2;2-4)

so I get that the 0,1,2 before | is referred to the 0,1 in the first section (before ;) and 0,2,4 is related to 1,2 in the first section.

Later on, there is the other example on syntactic sugar for repetition:

for six bins '0,1,2;0,1,2|0,1,2|0,2,4' can be more succinctly written as '0,1,2;0,1,2||0,2,4'

but in the case there are 2 bins in the first dimension (before ;), i.e. 0-1 and 1-2, while they're mapped over three different slices in the second dimension.

I'm confused, I'd have expected something like '0,1,2,3;0,1,2|0,1,2|0,2,4'...

You're right, the example is clearly wrong, good job for spotting the mistake!

felixhekhorn commented 2 years ago

About the | operator: maybe you don't care that much, but in Python they assigned a meaning to that operator in the specific context of sets. It means union.

That was actually what I was thinking about: "OR" in a bitwise sense, really meaning UNION (as implemented in C).

You can also think of | as making the bin limits dependent on the previous dimensions.

however, as @cschwan said, this is not really what the operator is doing ... "THEN" would be a better name, if I understood correctly

there must be p = n * m bin limit specifications (bls) separated by |, as in bls1|bls2|...|blsp

felixhekhorn commented 2 years ago

there must be p = n * m bin limit specifications (bls) separated by |, as in bls1|bls2|...|blsp

and this should really be mentioned in the docs

alecandido commented 2 years ago

I guess there is no default name for that operator. @cschwan actually needs a separator further than ,, ;, and :, but unfortunately all common punctuation characters have some meaning as operators.

Most likely | is good enough: / might also be fine (and it also has a well defined, different meaning), and \ is bad, since it's used for escaping and would give plenty of issues. All the others ($, %, @, #, !, &, ", ', =, ?, ^) are meaningless, and/or huge, and/or dangerous, and the other common separator - would be even more unexpected in this context.

For me, the three good ones are |, /, and _ (this last sometimes used as digit separator, without extra meaning). And maybe | is actually the best...

EDIT: of course I forgot +, the one you used in your first example, but it really looks weird as a separator, and , should have precedence over +. There is really no clean solution...

felixhekhorn commented 2 years ago

The operator | is perfectly fine - I think, this is just a documentation issue; how to explain the operation in a proper way ... (actually + would have been also a good choice, but let's stick to what we have )

alecandido commented 2 years ago

The operator | is perfectly fine - I think, this is just a documentation issue; how to explain the operation in a proper way ... (actually + would have been also a good choice, but let's stick to what we have )

Yeah, compare to other choices + is not that bad, you're right. But | is good enough, I pointed out the other meaning just to keep it mind while explaining the local one. And as you said, the union is somehow the refinement of the OR meaning, even though it's a different one. But being "compatible" it has been a good choice, unfortunately here we don't have such an option...

cschwan commented 2 years ago

I'd say + is confusing, because you'd assume it to be part of a number, for instance as in +1,2;-1,2 (technically not needed, but instinctively I think I would read it like that). It would also be confusing if you have negative bin limits, for instance -1,1+-2,2. With the pipe this is clearly better: -1,1|-2,2. I agree with @felixhekhorn that this really is a documentation problem.

cschwan commented 2 years ago

Commit f90f3ca90b56e818d62d65e4aa038c29fa35c4f7 improves the description. Please let me know what you think.

felixhekhorn commented 2 years ago

Definitely better. Maybe we can improve even more? How much white space or space in general can we use? how about going more into actual Python doc syntax?

- ',': The comma ',' constructs 1-dimensional bin limits (1DBL).
The bin limits are listed in a consecutive order and separated by a single ','.

Examples
--------------
- '0,0.2,0.4,0.6,1' expects the grid to have 4 bins whose bin limits will be (0-0.2), (0.2-0.4),
(0.4-0.6) and (0.6,1)

- ';': The semicolon ';' constructs n-dimensional bin limits (NDBL) from a cartesian product of 1DBL.
The 1DBL are listed by in a consecutive dimensional order separated by a single ';'.

Examples
-------------
-  '0,0.5,1;0,1,2' expects the grid to have 4 bins, whose 2DBL will be are: (0-0.5;0-1), (0-0.5;1-2), (0.5-1;0-1) and
(0.5-1;1-2)

- '|': The pipe '|' constructs NDBL which change by a step in the previous dimension.
The previous operators are enough to contruct NDBL with differently-sized bins, but they can
not construct the following bin limits: (0-1;0-1), (0-1;1-2), (1-2;0-2), (1-2;2-4), (1-2;4-6); here
the 1DBL for the second dimension depend on the first dimension and also have a different number of
bins. For the first two bins the 1DBL is '0,1,2', but for the last three bins the 1DBL are
'0,2,4,6'. This can be achieved using the following remapping string: '0,1,2;0,1,2|0,2,4,6'. Note
that there have to be two 1DBL separated by '|', because the first dimension has two bins. If there
are more dimensions and/or bins, the number of 1DBL separated by '|' must match this number
accordingly. [some adjustments needed here, maybe]

Examples
-------------
'0,1,2;-2,0,2;0,1,2|1,2,3|2,3,4|3,4,5|4,5,6|5,6,7'. Here the third dimension has 6 1DBL separated
by '|' because the first dimension has 2 bins and the second dimension has 3 bins, so `6 = 2 * 3`.
If the 1DBL is an empty string, the previous 1DBL is repeated, for
example '0,1,2;0,1,2;0,1,2||0,2,4' is shorthand for '0,1,2;0,1,2;0,1,2|0,1,2|0,2,4'

[etc.]
alecandido commented 2 years ago

For me, it's perfect: the concept is a bit tough, but there is nothing we can do, except explaining as good as possible, and I'd say it's well explained.

Unfortunately now it's a whole bunch of text, so I'd like to have more formatting options (like markdown) to make it more readable. clap doesn't offer anything like that, so the only other option might be to move it to an external source (most likely the best would be extremely simple as md, or readable as html), with the option to keep it embedded (e.g. like rust docs in rustup), or move somewhere else (e.g. online and link it).

cschwan commented 2 years ago

Separating the examples from the description is a very good idea and formatting it better also. As far as I understand the text is simply dumped after the help, but I'm looking into it the clap documentation to see if there's anything better...

alecandido commented 2 years ago

Separating the examples from the description is a very good idea and formatting it better also. As far as I understand the text is simply dumped after the help, but I'm looking into it the clap documentation to see if there's anything better...

I already had a look some weeks ago, but that's all they offer (and as you said it's simply a print after the generated help, or before with the partner function). There are libraries that output all termcodes for formatting, like https://github.com/Canop/termimad (in python we use https://github.com/Textualize/rich), but I don't know if it's a good idea to pass a string with termcodes to clap.

felixhekhorn commented 2 years ago

or move somewhere else (e.g. online and link it)

that might actually be the best option ... since online we can have infinite format

PS: and it would solve the duplication issue :upside_down_face:

alecandido commented 2 years ago

PS: and it would solve the duplication issue upside_down_face

Duplication issue still remains, because it was mainly testing the part generated by clap. To be honest, I like a lot that with rustup you can bring up docs offline, but maybe the solution would be to simply use docstrings (and document pineappl_cli crate), and then we can both link https://docs.rs, and you can also obtain offline with cargo doc --open. Still, having it in terminal is much more direct...

cschwan commented 2 years ago

We probably want this: https://github.com/clap-rs/clap/issues/2389. I can try the verbatim_doc_comment attribute in the meantime. Moving the documentation to somewhere else online might also be a good idea.

alecandido commented 2 years ago

I believe we already got something pretty good, what do you think is still missing for this? @cschwan

We could do a much better job if clap exposed more tools, but given the current status I believe is fine.

The other option of documenting pineappl_cli crate (and publish documentation) is always available, though I consider that it is not needed specifically for this issue (but we could want it in general). In case, let's open another issue, but I'm not advocating for it: for me it's fine as it is.

cschwan commented 2 years ago

@AleCandido I agree, let's close this Issue.