HuoLanguage / huo

interpreted language written in C
MIT License
212 stars 21 forks source link

Better parser? #23

Closed TheLoneWolfling closed 8 years ago

TheLoneWolfling commented 8 years ago

If you wish, I could see about replacing things with a proper knapsack / recursive descent + cut parser.

Currently parsing is rather hairy. For instance, (let x 1 [ ) ] compiles and runs (!). At a bare minimum, the parser should check nesting.

incrediblesound commented 8 years ago

This is a good idea. Basically, the part I care most about is how the AST is interpreted. I'm not terribly concerned with how the parser or tokenizer works. It would be nice, however, to keep things generally minimal and easy to understand. Maybe for now we can just improve the parser a little bit to be more robust, and if the language continues to improve and develop down the line we could consider replacing it entirely with something better.

TheLoneWolfling commented 8 years ago

Sounds good, then.

TheLoneWolfling commented 8 years ago

More "interesting" programs:

(0(for .`....4..................4(4.....

(Note: contains unprintable characters. Hex:

00000000  28 30 28 66 6f 72 20 2e  60 2e 2e 2e 2e 34 2e 01  |(0(for .`....4..|
00000010  2e 2e 2e 2e 2e 2e 2e 2e  2e 2e 2e 2e 2e 2e 2e 2e  |................|
00000020  2e 1e 34 28 34 2e 0e 2e  2e 2e 2e                 |..4(4......|

) ...not only does this tokenize (and parse!) but it hangs (due to executing a for loop 1082130432 - 4 times (!)).

incrediblesound commented 8 years ago

Haha oh my.... that's actually pretty impressive.

TheLoneWolfling commented 8 years ago

Such are the perils of fuzzing... You get really weird inputs, and you need to figure out what on Earth is actually going on. In that case, it's parser issues. atol works given valid input, but doesn't exactly like "weird" input (it doesn't do much of any error checking). Use strtol instead. (I've got a commit incoming that does just that...)

Actually, on the topic of tokenizer / parser low-hanging fruit:

  1. Do you want something like (print0) to "work"? In general, do you want whitespace to be optional where unambiguous?
  2. Do you want an optional trailing comma on lists?
  3. What do you want to separate tokens? Spaces? Tabs? Newlines? Unicode whitespace?
  4. ASCII or proper unicode support? Not as difficult as it sounds, at least when it comes to the basics.
  5. Do you want lists to have to be comma-separated? Or no? Currently the parser doesn't actually enforce this.
  6. Do you want multiple commas in a row to parse or to error out? Or be treated as "invalid" elements? Or what?