Open Seb35 opened 6 years ago
@promethe42: it would be great if you can study it and we discuss @mdamien: FYI
Also, I had to manage this same issue for alineas in metslesliens
. It was a bit easier because the space is smaller than in DuraLex. I solved it with a mechanism accumulate-collect in each scale, and the depth of this structure is 3. The rules are defined here (with their behaviour related to other rules) and the Parsimonious visitor is here.
The conversion to PEGs becomes difficult because we have to choose a (good) design to manage lists and combine it (properly) with hierarchical items. I mean expressions like "Au deuxième alinéa, à la troisième phrase du quatrième alinéa". A good real crash test is http://www.assemblee-nationale.fr/15/textes/0911.asp#D_Article_11.
Legacy DuraLex creates this tree:
The first version of ToSemanticTreeVisitor (now in its own file) creates this tree:
With b2f13ab I try a new design for ToSemanticTreeVisitor, it works at small scale given my experiments, it creates:
There is currently a small (easy to solve) issue because it creates an untyped container node (it shouldn’t for a sigle child). I didn’t try at a larger scale.
I have no precise idea if this design is a good design. The difficulty is to arbitrate between creating a flat list and/or a hierarchical tree. Another possible design would be to take into account the canonical hierarchy (word < sentence < alinea < article) during merge operation of child (Parsimonious) nodes.
I think a somehow good hierarchy is needed during parsing, before any DuraLex visitor since they cannot re-create some missing information. But probably some visitors will need to be adapted to take into account both flat lists and hierarchical items.