Closed felixhekhorn closed 2 years ago
The primary purpose of remap
is to map a one-dimensional distribution to an n-dimensional one (many Monte Carlo programs only produce one-dimensional distributions, and remap
is our way to work around it). I'm not quite sure this is the proper use in this case, but let's try it anyway to see what you need.
You can find an example here:
What it does is to completely ignore whatever bin limits you previously set, and apply new ones, based on a cartesian product. In the case above, '116,150,200,...;0,0.2,0.4,...' produces the two-dimensional bin limits in the following order:
The pipe |
is used to modify the limits for the second observable for specific values of the first observable. If you count the number of fields divided by the pipe you'll find five, one field for each of the five bins of the first observable. An empty field just means the field previously specified. In the example above that means specifically
By default remap
divides the result of each bin by the bin-width (for each observable). If you don't want that, you can disable it for specific observables using --ignore-obs-norm index
where index
is the index of the observable starting from 0
. An example for that you can find here,
https://github.com/NNPDF/runcards/blob/master/nnpdf31_proc/ATLAS_DY_8TEV_3D_046080_0007/postrun.sh
where only the bin width for the first (0
) observable is taken into account.
Finally, you can use --norm factor
, which applies factor
multiplicatively for all bins (for example for the femtobarn/picobarn conversion).
Maybe @cschwan it might be an idea to extend CLI help with examples, at some point.
It's fine that help is the only docs for a CLI (unless you want to write a man
page, clap-rs/clap#552), but I'd say examples saved my day so many times...
You could also consider to add a couple to https://tldr.sh/
Commit 25fc84a413fb968b192ddd103448cc58756c3b5f adds additional documentation and even a few basic examples. Let me know what you think!
Sorry not to have replied before. It looks good, just a small observation:
The commit probably doesn't belong to any branch because I've merged it; what is being duplicated, where is the discussion?
The duplication is simply in the test: since you edited the message, you had to duplicate the string in the test. Since this requires manual sync at any update, it is fragile (https://github.com/N3PDF/pineappl/pull/95#issuecomment-987631410).
I think the main Issue here has been addressed, therefore I'll close it.
There is a small typo (00.5
)
https://github.com/N3PDF/pineappl/blob/aeddd4fb907ec0fbb91cc01972e539f8bcf67c41/pineappl_cli/src/remap.rs#L28
I still don't get properly the |
operation: 0,1,2;0,1,2|0,2,4
is this equivalent to 0,1;0,1,2
+ 1,2;0,2,4
(where +
is meant in a mathematical set meaning ...)
Thanks for reviewing it, @felixhekhorn!
The |
operation is exactly the +
operation you described :smile:! But it's an operation that's needed because there's no combination of the other operations that lets you mimick |
. Does that make sense, @felixhekhorn?
fine! the operations is to be read as "OR" ... maybe you could give this specific example?
What I understand is that the documentation needs to improved for |
, so I'm opening this Issue again.
I wouldn't call the |
operation OR because the number of fields is fixed by the previous dimensions. For instance, if you have the following prescription:
x0,x1,x2,...,xn;y0,y1,...,ym;bls1|bls2|...|blsp
there must be p = n * m
bin limit specifications (bls
) separated by |
, as in bls1|bls2|...|blsp
. If you have more dimensions before a specification containing |
there there must be as many bin limit specifications separated by |
as the product of the number of bins for each dimension coming before |
.
You can also think of |
as making the bin limits dependent on the previous dimensions. Without it the cartesian product makes sure that all the bin limits are independent of each other. A good example of that is ATLAS_DY_8TEV_3D, which has three dimensions,
but the bins of rapidity depend on the invariant mass and the angle (the previous dimensions). The further you go away from the Z peak at roughly 90.0 and the more extreme the angle is, the more rapidity bins are 'cut out' of the cartesian product 'cube'.
I'm still missing something: in the example there is written
'0,1,2;0,1,2|0,2,4' expects a grid with a total of 4 bins and produces the following bin limits: (0-1;0-1), (0-1;1-2), (1-2;0-2), (1-2;2-4)
so I get that the 0,1,2
before |
is referred to the 0,1
in the first section (before ;
) and 0,2,4
is related to 1,2
in the first section.
Later on, there is the other example on syntactic sugar for repetition:
for six bins '0,1,2;0,1,2|0,1,2|0,2,4' can be more succinctly written as '0,1,2;0,1,2||0,2,4'
but in the case there are 2 bins in the first dimension (before ;
), i.e. 0-1
and 1-2
, while they're mapped over three different slices in the second dimension.
I'm confused, I'd have expected something like '0,1,2,3;0,1,2|0,1,2|0,2,4'
...
About the |
operator: maybe you don't care that much, but in Python they assigned a meaning to that operator in the specific context of sets. It means union.
(Very recently they started using that operator even in the context of types, like in Typescript, but the one for actual sets existed at least from 3.0, if not before...)
I'm still missing something: in the example there is written
'0,1,2;0,1,2|0,2,4' expects a grid with a total of 4 bins and produces the following bin limits: (0-1;0-1), (0-1;1-2), (1-2;0-2), (1-2;2-4)
so I get that the
0,1,2
before|
is referred to the0,1
in the first section (before;
) and0,2,4
is related to1,2
in the first section.Later on, there is the other example on syntactic sugar for repetition:
for six bins '0,1,2;0,1,2|0,1,2|0,2,4' can be more succinctly written as '0,1,2;0,1,2||0,2,4'
but in the case there are 2 bins in the first dimension (before
;
), i.e.0-1
and1-2
, while they're mapped over three different slices in the second dimension.I'm confused, I'd have expected something like
'0,1,2,3;0,1,2|0,1,2|0,2,4'
...
You're right, the example is clearly wrong, good job for spotting the mistake!
About the
|
operator: maybe you don't care that much, but in Python they assigned a meaning to that operator in the specific context of sets. It means union.
That was actually what I was thinking about: "OR" in a bitwise sense, really meaning UNION (as implemented in C).
You can also think of | as making the bin limits dependent on the previous dimensions.
however, as @cschwan said, this is not really what the operator is doing ... "THEN" would be a better name, if I understood correctly
there must be p = n * m bin limit specifications (bls) separated by |, as in bls1|bls2|...|blsp
there must be p = n * m bin limit specifications (bls) separated by |, as in bls1|bls2|...|blsp
and this should really be mentioned in the docs
I guess there is no default name for that operator. @cschwan actually needs a separator further than ,
, ;
, and :
, but unfortunately all common punctuation characters have some meaning as operators.
Most likely |
is good enough: /
might also be fine (and it also has a well defined, different meaning), and \
is bad, since it's used for escaping and would give plenty of issues.
All the others ($
, %
, @
, #
, !
, &
, "
, '
, =
, ?
, ^
) are meaningless, and/or huge, and/or dangerous, and the other common separator -
would be even more unexpected in this context.
For me, the three good ones are |
, /
, and _
(this last sometimes used as digit separator, without extra meaning). And maybe |
is actually the best...
EDIT: of course I forgot +
, the one you used in your first example, but it really looks weird as a separator, and ,
should have precedence over +
. There is really no clean solution...
The operator |
is perfectly fine - I think, this is just a documentation issue; how to explain the operation in a proper way ... (actually +
would have been also a good choice, but let's stick to what we have )
The operator
|
is perfectly fine - I think, this is just a documentation issue; how to explain the operation in a proper way ... (actually+
would have been also a good choice, but let's stick to what we have )
Yeah, compare to other choices +
is not that bad, you're right. But |
is good enough, I pointed out the other meaning just to keep it mind while explaining the local one.
And as you said, the union is somehow the refinement of the OR meaning, even though it's a different one. But being "compatible" it has been a good choice, unfortunately here we don't have such an option...
I'd say +
is confusing, because you'd assume it to be part of a number, for instance as in +1,2;-1,2
(technically not needed, but instinctively I think I would read it like that). It would also be confusing if you have negative bin limits, for instance -1,1+-2,2
. With the pipe this is clearly better: -1,1|-2,2
. I agree with @felixhekhorn that this really is a documentation problem.
Commit f90f3ca90b56e818d62d65e4aa038c29fa35c4f7 improves the description. Please let me know what you think.
Definitely better. Maybe we can improve even more? How much white space or space in general can we use? how about going more into actual Python doc syntax?
- ',': The comma ',' constructs 1-dimensional bin limits (1DBL).
The bin limits are listed in a consecutive order and separated by a single ','.
Examples
--------------
- '0,0.2,0.4,0.6,1' expects the grid to have 4 bins whose bin limits will be (0-0.2), (0.2-0.4),
(0.4-0.6) and (0.6,1)
- ';': The semicolon ';' constructs n-dimensional bin limits (NDBL) from a cartesian product of 1DBL.
The 1DBL are listed by in a consecutive dimensional order separated by a single ';'.
Examples
-------------
- '0,0.5,1;0,1,2' expects the grid to have 4 bins, whose 2DBL will be are: (0-0.5;0-1), (0-0.5;1-2), (0.5-1;0-1) and
(0.5-1;1-2)
- '|': The pipe '|' constructs NDBL which change by a step in the previous dimension.
The previous operators are enough to contruct NDBL with differently-sized bins, but they can
not construct the following bin limits: (0-1;0-1), (0-1;1-2), (1-2;0-2), (1-2;2-4), (1-2;4-6); here
the 1DBL for the second dimension depend on the first dimension and also have a different number of
bins. For the first two bins the 1DBL is '0,1,2', but for the last three bins the 1DBL are
'0,2,4,6'. This can be achieved using the following remapping string: '0,1,2;0,1,2|0,2,4,6'. Note
that there have to be two 1DBL separated by '|', because the first dimension has two bins. If there
are more dimensions and/or bins, the number of 1DBL separated by '|' must match this number
accordingly. [some adjustments needed here, maybe]
Examples
-------------
'0,1,2;-2,0,2;0,1,2|1,2,3|2,3,4|3,4,5|4,5,6|5,6,7'. Here the third dimension has 6 1DBL separated
by '|' because the first dimension has 2 bins and the second dimension has 3 bins, so `6 = 2 * 3`.
If the 1DBL is an empty string, the previous 1DBL is repeated, for
example '0,1,2;0,1,2;0,1,2||0,2,4' is shorthand for '0,1,2;0,1,2;0,1,2|0,1,2|0,2,4'
[etc.]
For me, it's perfect: the concept is a bit tough, but there is nothing we can do, except explaining as good as possible, and I'd say it's well explained.
Unfortunately now it's a whole bunch of text, so I'd like to have more formatting options (like markdown) to make it more readable.
clap
doesn't offer anything like that, so the only other option might be to move it to an external source (most likely the best would be extremely simple as md, or readable as html), with the option to keep it embedded (e.g. like rust docs in rustup
), or move somewhere else (e.g. online and link it).
Separating the examples from the description is a very good idea and formatting it better also. As far as I understand the text is simply dumped after the help, but I'm looking into it the clap documentation to see if there's anything better...
Separating the examples from the description is a very good idea and formatting it better also. As far as I understand the text is simply dumped after the help, but I'm looking into it the clap documentation to see if there's anything better...
I already had a look some weeks ago, but that's all they offer (and as you said it's simply a print after the generated help, or before with the partner function).
There are libraries that output all termcodes for formatting, like https://github.com/Canop/termimad (in python we use https://github.com/Textualize/rich), but I don't know if it's a good idea to pass a string with termcodes to clap
.
or move somewhere else (e.g. online and link it)
that might actually be the best option ... since online we can have infinite format
PS: and it would solve the duplication issue :upside_down_face:
PS: and it would solve the duplication issue upside_down_face
Duplication issue still remains, because it was mainly testing the part generated by clap
.
To be honest, I like a lot that with rustup
you can bring up docs offline, but maybe the solution would be to simply use docstrings (and document pineappl_cli
crate), and then we can both link https://docs.rs, and you can also obtain offline with cargo doc --open
.
Still, having it in terminal is much more direct...
We probably want this: https://github.com/clap-rs/clap/issues/2389. I can try the verbatim_doc_comment
attribute in the meantime. Moving the documentation to somewhere else online might also be a good idea.
I believe we already got something pretty good, what do you think is still missing for this? @cschwan
We could do a much better job if clap
exposed more tools, but given the current status I believe is fine.
The other option of documenting pineappl_cli
crate (and publish documentation) is always available, though I consider that it is not needed specifically for this issue (but we could want it in general).
In case, let's open another issue, but I'm not advocating for it: for me it's fine as it is.
@AleCandido I agree, let's close this Issue.
To continue the APPL comparison I will need to renormalize the grids - I saw that the
remap
command seems to be able to do such a thing, but I don't know how to use it; i.e. what is the "remap string" - can you provide us with an example?