Whitespace and newlines in coordinate sections between content and $end is not stripped

dev-cafe / parselglossy

Generic input parsing library, speaking in tongues.

MIT License

7 stars 2 forks source link

parselglossy version: 0.7.0
Python version: 3.9.2
Operating System: MacOS
Context: MRChem

Description

Upon parsing coordinate sections such as those used when specifying atomic coordinates

$coords
He 0.0 0.0 0.0
$end

or solvation cavity spheres

$spheres
0.0 0.0 0.0 4.0
$end

the parser ignores all whitespace and newlines between $start and the actual content, but does not ignore whitespace and newlines between the content and $end. As a result, the following sections are not parsed identically:

$coords
He 0.0 0.0 0.0
$end

$coords
He 0.0 0.0 0.0
               $end

$coords
He 0.0 0.0 0.0$end

These result in the following strings, respectively

"He 0.0 0.0 0.0\n"
"He 0.0 0.0 0.0\n               "
"He 0.0 0.0 0.0"

The expected output for all is (at least to me) the last one. This could become a bit problematic when the user indents these sections (very common to do), and some type of sanity checking is performed on the data. Consider the following

lines = user_dict['Molecule']['coords'].splitlines()
print(lines)

results in for the three examples

["He 0.0 0.0 0.0"]
["He 0.0 0.0 0.0", "              "]
["He 0.0 0.0 0.0"]

The middle example has resulted in an empty list element. strip()ing beforesplit()ing fixes the issue, but parselglossy should probably strip all extra whitespace under the hood.

TL;DR this is so by design. The "contract" between parselglossy and its users is that parselglossy won't touch what's between $<name>/$end.

This is where and how the parsing token is defined for those kinds of parameters: https://github.com/dev-cafe/parselglossy/blob/master/parselglossy/grammars/atoms.py#L89-L95

I cannot find the issue where we discussed this (it might be in a thread on some Zulip channel) but the $<name>/$end parameters are by design escape hatches to pass untyped information verbatim past the input parser and into the final dictionary. The idea was to keep the grammar simple and avoid type-checking for things that the developers using parselglossy know how to read better than we could. Preserving indentation might be one of the use cases for this: it is a weird requirement in the context of parsing molecular geometries, but it might be essential somewhere else.

dev-cafe / parselglossy

Whitespace and newlines in coordinate sections between content and $end is not stripped #110

Description