PKua007 commented 1 year ago

A consistent and convenient interface for the input file should be introduced. This also includes a generic parser in the source code.

PKua007 commented 1 year ago

Case study

As an example, we will look at specifying two bulk observables.

Current syntax

bulkObservables = pairAveragedCorrelation 5 100 S110 primary layerwiseRadial 6.0.0 o , \
                  densityHistogram n_bins 0 100 100 tracker fourierTracker 0 2 1 primaryAxis x

Currently, observables are separated using ,. First comes the name, then the arguments. The arguments are in different formats.

For pairAveragedCorrelation, first is maximal distance 5, then the number of bins 100, then averaged function S110 primary (which is $S_{110}$ correlation for the primary axes) and finally the specification of binning layerwideRadial 6.0.0 o (where 6.0.0 are Miller indices describing layers and o is geometric origin as the focal point).

densityHistogram uses a bit different syntax. It is densityHistogram n_bins ... tracker ..., where n_bins, tracker are named fields, which are followed by their arguments and can be in an arbitrary order. n_bins arguments are x, y, z number of bins. tracker is fourierTracker, whose arguments are 0 2 1 (wavenumbers) and primaryAxis x (x coordinate of the primary axis as a function).

Pros:
- short and concise notation
- backwards compatible
Cons:
- may be hard to read, especially without explicit nesting
- parsing has to be done mostly manually, including validation if all arguments are given, etc.

It may be improvement by forcing to use the syntax from densityHistogram for all multivalued fields:

bulkObservables = pairAveragedCorrelation max_r 5 n_bin 100 function S110 primary binning layerwiseRadial hkl 6.0.0 focal_point o , \
                  densityHistogram n_bins 0 100 100 tracker fourierTracker wavenumbers 0 2 1 function primaryAxis x

but the problem of ambiguous nesting persits.

Improved syntax

The syntax can be improved by introducing nesting inspired by Python functions. Take as an example:

def func(a, b, c, d):
    pass

# valid invocations:
func(0, 1, 2, 3)
func(a=0, b=1, c=2, d=3)
func(0, 1, d=3, c=2)

Using is as a guidance, it can be used to devise an improved syntax:

bulkObservables = [
  pairAveragerCorrelation{max_r=5, bin_n=100, function=S110{axis=primary}, binning=layerwiseRadial{hkl=6.0.0, focal_point=o}},
  densityHistogram{n_bins=0 100 100, tracker=fourierTracker{n=0 2 1, function=primaryAxis{coord=x}}}
]

All names of keys may be skipped if they are given in the correct order, or some of them may be left:

bulkObservables = [
  pairAveragerCorrelation{5, 100, S110{primary}, layerwiseRadial{6.0.0, o}},
  densityHistogram{0 100 100, fourierTracker{0 2 1, primaryAxis{x}}}
]

Pros:
- very clear
- still quite short and concise
- each entry like pairAveragedCorrelation can be predefined for the parser, which will automatically report some of errors
- still mostly backwards compatible
Cons:
- the parser has to written manually

YAML

YAML can be used

bulkObservables:
- pairAveragedCorrelation:
    max_r: 5
    bin_n: 100
    function:
      S110:
        axis: primary
    binning:
      layerwiseRadial:
        hkl: 6.0.0
        focal_point: o
- densityHistogram:
    n_bins: 0 100 100
    tracker:
      fourierTracker:
        n: 0 2 1
        axis:
          primaryAxis:
            coord: x

Pros:
- standarized, well known format
- existing convinient libraries for parsing
- farily readable
Cons:
- not backwards compatible - all existing input files have to be rewritten
- very verbose
- validation has to be done mostly manually

misiekc commented 1 year ago

If changes are needed I'll go to some well known format (YAML, JSON, etc.). Backward compatibility can be kept as a separate parser can be chosen according to file extension (or there will be mechanism for conversion from present format to the new one)

PKua007 commented 1 year ago

Conversion mechanism is a good idea, I will integrate it if the breaking change are introduced.

I am a bit worried that formats like YAML will be quite verbose leading to worse readability anyway. Can you propose a more concise syntax for the given exemplary parameters using YAML or perhaps you know a format better suited to our needs?

And maybe this INI extension which looks like Python can be considered as well known? It can be made even more Python-like by replacing {...} with (...) and rewriting all space-separated fields like 0 100 100 as Python-like arrays [0, 100, 100].

PKua007 commented 1 year ago

To make it clear, it will then look like this

bulkObservables = [
  pairAveragerCorrelation(max_r=5, bin_n=100, function=S110(axis=primary), binning=layerwiseRadial(hkl="6.0.0", focal_point="o")),
  densityHistogram(n_bins=[0, 100, 100], tracker=fourierTracker(n=[0, 2, 1], function=primaryAxis(coord="x")))
]

Things like pairAveragedCorrelation(...) look just like class constructors, which actually describes them well - it translates to creating PairAveragedCorrelation which implements BulkObservable. It can be even made 100% identical with future Python bindings.

//edit: Actually the above code is 100% valid Python code. Custom parser can be replaced in the future by invoking the Python interpreter.

misiekc commented 1 year ago

In my opinion the python-like convention looks best especially if a parser is available. Otherwise I don’t think it is worth the effort

PKua007 commented 1 year ago

There are python parsing libraries for C++. There are also general libraries for parsing using BNF grammar. I am not sure which option is more convenient. Python parser can take any Python code - additional AST validation and a lot of conversion is needed. The other option requires devising BNF grammar and once again conversion of AST, but less extensive.

There is also a manual option: writing recursive descent parser for a simple grammar isn't that hard and there is a full control over error reporting, etc.

PKua007 / rampack

Proper, convenient input file interface #10

Case study

Current syntax

Improved syntax

YAML