Open nsmith- opened 1 year ago
I would propose to define a grammar that could then be used to parse datacards into in-memory structures. I'm somewhat familiar with PEG grammars (see references below) but also there are many others.
Some references: https://bford.info/pub/lang/peg.pdf https://peps.python.org/pep-0617/#overview https://github.com/yhirose/cpp-peglib (this is what is used in correctionlib for the formulas) https://lark-parser.readthedocs.io/en/latest/
A WIP implementation is shown below (but it could probably be expressed much better)
EXPRESSION <- CountChecks ShapeSources? Observation Expectation Systematics CountChecks <- CountCheck{3} SectionSeparator ShapeSources <- ShapeSource+ SectionSeparator Observation <- BinList 'observation' (Space+ Float)* EndOfLine SectionSeparator Expectation <- BinList ProcessList ProcessIndexList RateList SectionSeparator Systematics <- Systematic* SectionSeparator RateParam* CountCheck <- CountType Space (Integer / '*') (!EndOfLine .)* EndOfLine ShapeSource <- 'shapes' Space+ ProcessMatch Space+ ChannelMatch Space+ FileName Space+ HistName (Space+ HistName)? RestOfLine BinList <- 'bin' Space+ (ChannelName (Space+ ChannelName)*) RestOfLine ProcessList <- 'process' (Space+ ProcessName)* RestOfLine ProcessIndexList <- 'process' (Space+ Integer)* RestOfLine RateList <- 'rate' (Space+ Float)* RestOfLine Systematic <- SystName Space+ SystType (Space+ SystEffect)* RestOfLine RateParam <- SystName Space+ 'rateParam' Space+ ChannelMatch Space+ ProcessMatch Space+ (Param / Formula) RestOfLine ~EndOfLine <- '\r\n' / '\n' / '\r' ~RestOfLine <- (Space* (EndOfLine / Comment))* ~Space <- ' ' / '\t' Comment <- '#' (!EndOfLine .)* EndOfLine CountType <- < [ijk] 'max' > Integer <- < '-'? [0-9]+ > Float <- < '-'? [0-9]+ ('.' [0-9]*)? > ~SectionSeparator <- ('-'+ RestOfLine)? ProcessMatch <- < ProcessName / '*' > ChannelMatch <- < ChannelName / '*' > ProcessName <- < [a-zA-Z0-9_]+ > ChannelName <- < [a-zA-Z0-9_]+ > FileName <- < [^ \t]+ > HistName <- < [a-zA-Z0-9_/]+ > SystName <- < [a-zA-Z0-9_]+ > SystType <- < 'lnN' | 'lnU' | 'trG' > SystEffect <- < (Float '/' Float) / Float / '-' > Param <- Float (Space* '[' Space* Float Space* ',' Space* Float Space* ']')? Formula <- (QuotedString / String) (Space+ Param)* String <- < (!Space .)* > QuotedString <- < '"' (!'"' ('\\"' / .))* '"' >
You can play with this online with https://yhirose.github.io/cpp-peglib/
I would propose to define a grammar that could then be used to parse datacards into in-memory structures. I'm somewhat familiar with PEG grammars (see references below) but also there are many others.
Some references: https://bford.info/pub/lang/peg.pdf https://peps.python.org/pep-0617/#overview https://github.com/yhirose/cpp-peglib (this is what is used in correctionlib for the formulas) https://lark-parser.readthedocs.io/en/latest/
A WIP implementation is shown below (but it could probably be expressed much better)
You can play with this online with https://yhirose.github.io/cpp-peglib/