Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
504 stars 85 forks source link

Example "yaml.l" is buggy #158

Closed geert56 closed 1 year ago

geert56 commented 1 year ago

I ran the yaml lexer/parser produced by make yaml on some example files derived from the YAML wikipedia page (https://en.wikipedia.org/wiki/YAML) and noticed some strange behavior. With the option SHOW_TOKENS set, I clearly see erroneous indentation tokens and hence the final echoed yaml is incorrect.

Here is a very small example:

items:
    - part_no:   A4786
      descrip:   Water Bucket (Filled)
      price:     1.47
      quantity:  4

which produces:

"items": 
  - 
    "part_no": "A4786"
"descrip": "Water Bucket (Filled)"
"price": "1.47"
"quantity": "4"
--- 
--- 
...
genivia-inc commented 1 year ago

Hm, looks like a bug in this demo application where indents are checked. The Yaml renderer is fine, as you can verify with this example:

items:
    - { part_no:   A4786,
        descrip:   Water Bucket (Filled),
        price:     1.47,
        quantity:  4 }

Also this works:

items:
    -
      part_no:   A4786
      descrip:   Water Bucket (Filled)
      price:     1.47
      quantity:  4
geert56 commented 1 year ago

Correct, some variations definitely work. Overall the yaml.l lexer/parser is rather shaky as compared to js-yaml (node.js) or yamllint. As I already mentioned in an other issue, YAML is tough. I am happy to discuss details by email. Thanks for looking into this; I'll close the issue.

geert56 commented 1 year ago

Closed.

genivia-inc commented 1 year ago

Hmmm, it's not the lexer, but the logic behind the yaml rules that is the problem. These rules are complex, as you've also said. The problem can be fixed by recognizing the indent position after the - before the map key, so that subsequent map keys are grouped together. I thought it would be nice demo for reflex capable of handling indentation. Getting the yaml rules implemented is a bit of a pain and results in convoluted code.

geert56 commented 1 year ago

Might I suggest to shy away from full YAML and maybe use StrictYAML as the example?

geert56 commented 1 year ago

A great implementation of YAML is libfyaml. It passes the whole testsuite and comes with some interesting tools.

genivia-inc commented 1 year ago

Thanks for the suggestions. I had in fact tested yaml.l with a number of realistic yaml examples, but perhaps the examples I had used were all strict(er) yaml or were sanitized to pass most yaml parsers. I don't recall where I found these. I do recall spending way more time on this example than I had anticipated (a few days, instead of a few hours at most that I usually need to get the job done + testing). I worked extensively with XML and JSON as well as CORBA and other (more ancient) exchange formats. Compared to those, yaml is terrible IMO. Sure, yaml is "human readable". I get that, but otherwise what's the point of it? And why make the syntax so lenient? Stricter rules help, not hamper.

genivia-inc commented 1 year ago

FYI. Here is a set of yaml tests that I ran as unit tests and to test the features when implementing yaml.l. There are tests for all (or almost all) various syntactic structures, except the case you've reported here. Back in 2020 I didn't find a suitable set of yaml test cases online to tests against.

yamltests.zip

genivia-inc commented 1 year ago

Fixed the problem. This fix passes all YAML wikipedia examples too.

Patch yaml.l:618 to insert:

    size_t level = 0;

yaml.l:640 insert:

      if (token == ';' || token == '=')
        next();
      if (token == '>')
      {
        next();
        ++level;
      }

yaml:671 insert:

        if (token == '&' || token == '*')
        {
          data.ref = string;
          next();
        }

yaml.l:688 insert:

          if (token == '<')
          {
            next();
            while (token == '<' && level)
            {
              next();
              --level;
            }
          }
genivia-inc commented 1 year ago

@geert56 With the option SHOW_TOKENS set, I clearly see erroneous indentation tokens and hence the final echoed yaml is incorrect.

FYI. The indentation tokens are not erroneous! There are additional indention positions inserted with matcher().insert_stop(matcher().columno()) by the parser. These indentation stops indicate the start of a value at which subsequent yaml data on lines below may align, so these will not produce new indents but rather align as expected.

It is all pretty clear in the yaml.l parser logic.