camdencheek / tree-sitter-dockerfile

A tree-sitter grammar for Dockerfile
MIT License
73 stars 22 forks source link

Disambiguate between two kinds of line continuations #6

Closed mjambon closed 2 years ago

mjambon commented 2 years ago

Hi,

I'm the lead author of ocaml-tree-sitter and I'm working on supporting dockerfile syntax in semgrep. I ran into an issue that doesn't pose a problem to the tree-sitter parser itself but creates an ambiguity that prevents ocaml-tree-sitter from working.

This PR introduces a new name for the line continuations that appear explicitly in the shell_command rule, distinguishing it from the line continuations that can occur anywhere as declared by the extras field. This makes ocaml-tree-sitter happy, allowing the recovery of a fully typed tree.

https://github.com/returntocorp/ocaml-tree-sitter-core/pull/20 is related to this.

camdencheek commented 2 years ago

Hi @mjambon! Thanks for your work on semgrep.

The change looks good to me. The alias to keep from breaking consumers of the syntax tree is nice.

Thanks for the contribution!

camdencheek commented 2 years ago

I went ahead and released this as v0.1.1 so you'll have a release to build against.

I expect a new minor version in the next couple of weeks with heredoc and custom escape directive support, but I also can't guarantee there will be no breaking changes in order to implement that, since it'll require dropping down into C and rethinking some of how this does shell parsing. All that to say, now you have a version to pin against that won't include unexpected changes.

mjambon commented 2 years ago

Great, thank you!

Note that for shell parsing in semgrep, we call the tree-sitter-bash parser on the shell command extracted with the dockerfile parser. The dockerfile grammar could "piggyback" on the tree-sitter-bash implementation, i.e. use it but not extend it since the root rule is different. Personally, I would prefer the current solution because it would require semgrep to use the same exact version of tree-sitter-bash when parsing standalone bash and when parsing dockerfile, since we don't want to duplicate the code for integrating bash.

We'd need a separate parsing mode for Windows-docker though, since on Windows a different shell is assumed. I also just saw this note about parsing directives which probably is out of scope for tree-sitter-dockerfile.

camdencheek commented 2 years ago

The dockerfile grammar could "piggyback" on the tree-sitter-bash implementation, i.e. use it but not extend it since the root rule is different.

I don't plan to change this 🙂 I wrote this primarily for highlighting support in neovim, and that's exactly how that works as well (defers shell parsing to the tree-sitter-bash grammar). The thing that I might need to change is "what handles linebreaks".

Right now, the dockerfile grammar consumes escaped linebreaks, which works okay because the syntax for escaped shell linebreaks is the same as the syntax for escaped dockerfile linebreaks. However, for the heredoc syntax, there is no need to escape newlines between commands, which breaks some of the assumptions of this grammar.

But, in any case, the dockerfile grammar will always just parse as far as a "shell" node, and defer any additional parsing to another grammar.

I also just saw this note about parsing directives which probably is out of scope for tree-sitter-dockerfile.

I agree for the most part that directives are out of scope, but I do want to add support for the escape directive. It's pretty widely used for Windows, and currently, this parser breaks pretty dramatically if the escape character is changed.

This also plays into the linebreak stuff mentioned above