cms-analysis / HiggsAnalysis-CombinedLimit

CMS Higgs Combination toolkit.
https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/latest
Apache License 2.0
75 stars 381 forks source link

Define formal grammar for datacard #818

Open nsmith- opened 1 year ago

nsmith- commented 1 year ago

I would propose to define a grammar that could then be used to parse datacards into in-memory structures. I'm somewhat familiar with PEG grammars (see references below) but also there are many others.

Some references: https://bford.info/pub/lang/peg.pdf https://peps.python.org/pep-0617/#overview https://github.com/yhirose/cpp-peglib (this is what is used in correctionlib for the formulas) https://lark-parser.readthedocs.io/en/latest/

A WIP implementation is shown below (but it could probably be expressed much better)

EXPRESSION  <- CountChecks ShapeSources? Observation Expectation Systematics
CountChecks <- CountCheck{3} SectionSeparator
ShapeSources <- ShapeSource+ SectionSeparator
Observation <- BinList 'observation' (Space+ Float)* EndOfLine SectionSeparator
Expectation <- BinList ProcessList ProcessIndexList RateList SectionSeparator
Systematics <- Systematic* SectionSeparator RateParam*

CountCheck  <- CountType Space (Integer / '*') (!EndOfLine .)* EndOfLine

ShapeSource <- 'shapes' Space+ ProcessMatch Space+ ChannelMatch Space+ FileName Space+ HistName (Space+ HistName)? RestOfLine

BinList <- 'bin' Space+ (ChannelName (Space+ ChannelName)*) RestOfLine
ProcessList <- 'process' (Space+ ProcessName)* RestOfLine
ProcessIndexList <- 'process' (Space+ Integer)* RestOfLine
RateList <- 'rate' (Space+ Float)* RestOfLine

Systematic <- SystName Space+ SystType (Space+ SystEffect)* RestOfLine

RateParam <- SystName Space+ 'rateParam' Space+ ChannelMatch Space+ ProcessMatch Space+ (Param / Formula) RestOfLine

~EndOfLine   <- '\r\n' / '\n' / '\r'
~RestOfLine <- (Space* (EndOfLine / Comment))*
~Space  <- ' ' / '\t'
Comment     <- '#' (!EndOfLine .)* EndOfLine
CountType   <- < [ijk] 'max' >
Integer     <- < '-'? [0-9]+ >
Float       <- < '-'? [0-9]+ ('.' [0-9]*)? >
~SectionSeparator <- ('-'+ RestOfLine)?

ProcessMatch <- < ProcessName / '*' >
ChannelMatch <- < ChannelName / '*' >
ProcessName <- < [a-zA-Z0-9_]+ >
ChannelName <- < [a-zA-Z0-9_]+ >
FileName <- < [^ \t]+ >
HistName <- < [a-zA-Z0-9_/]+ >
SystName <- < [a-zA-Z0-9_]+ >
SystType <- < 'lnN' | 'lnU' | 'trG' >
SystEffect <- < (Float '/' Float) / Float / '-' >
Param <- Float (Space* '[' Space* Float Space* ',' Space* Float Space* ']')?
Formula <- (QuotedString / String) (Space+ Param)*
String <- < (!Space .)* >
QuotedString  <- < '"' (!'"' ('\\"' / .))* '"' >

You can play with this online with https://yhirose.github.io/cpp-peglib/