Support heredocs (closes #4)

tvrinssen commented 1 year ago

This PR adds support for heredocs (the <<EOF blocks) using an external scanner written C. I used C instead of C++ to avoid introducing new dependencies (like libstdcxx).

The scanner currently has two arbitrary limits that I've added to keep the memory management simple: It can only track up to 10 open heredocs at a time. I've never seen anyone use more than 3 so this seems more than enough and having a static size for the array means I don't have to resize at any point. The second limit stems from the fact that C doesn't have a proper string type (like std::string) and I didn't want to dynamically resize the memory so I just reserve 128 bytes for the heredoc marker (the string behind the << characters). This seems like a fair compromise that reserves enough space for any markers I've actually seen used (most are far shorter like EOT, EOF, SCRIPT, etc.) and wasting space with oversized allocations. Finally, even if I removed these limits, we'd still be limited to 1024 bytes for our internal state by tree-sitter (TREE_SITTER_SERIALIZATION_BUFFER_SIZE).

With that out of the way, here are the important implementation details:

The scanner introduces three new symbols: heredoc_marker, heredoc_content and heredoc_end. error_sentinel is only used to detect error states as recommended by tree-sitter's documentation.
heredoc_marker is roughly equivalent to the regex <<-?(?:"[^"]+"|'[^']+'|[^\s]+). However, it also stores any matched marker in the scanner state for end marker matching.
heredoc_content is equivalent to [^\n]*] but only matches if heredoc_end doesn't match.
heredoc_end only matches the currently active heredoc marker and removes it from the list of active markers once it matches.
The scanner generally operates in two modes: It's either looking for a marker (scan_marker) or it's parsing a heredoc (scan_content). The main entry point for the scanner is tree_sitter_dockerfile_external_scanner_scan which decides which mode to use based on the currently valid symbols. If tree-sitter is performing error recovery, we decide the mode based on whether we're currently inside a heredoc or not.

I've updated the grammar to add the heredoc_marker symbol where relevant and I've added a new heredoc_block rule to match the heredoc_content and heredoc_end symbols between instructions. To make sure that the scanner is working as intended, I've also added some heredoc examples to the test corpus. I wrote a few myself and the rest are taken from BuildKit's unit tests. Finally, I also added the new symbols to highlights.scm.

tree-sitter test passes and I've been this with Helix to edit many different Dockerfiles without issues. Even complex cases like the following work fine:

FROM busybox

RUN cat <<EOF1 | tr '[:upper:]' '[:lower:]' > ./out1; \
    cat <<EOF2 | tr '[:lower:]' '[:upper:]' > ./out2
hello WORLD
EOF1
HELLO world
EOF2

camdencheek commented 1 year ago

Very exciting, thanks for the PR! Will work through a review in the next couple days

mopp commented 11 months ago

@tvrinssen @camdencheek Thank you for your fixing! I faced the error and this PR looks to resolve it. I'm waiting to merge this PR ✨

Error in decoration provider treesitter/highlighter.win:
Error executing lua: ...1/.local/share/nvim/runtime/lua/vim/treesitter/query.lua:248: Query error at 9:4. Invalid node type "heredoc_end":
  (heredoc_end) @injection.language)

https://github.com/camdencheek/tree-sitter-dockerfile/issues/4

camdencheek commented 11 months ago

@tvrinssen do you plan to follow up on this PR? If not, I'm happy to make the required changes, just wanted to give you the chance to take a pass at it first.

natw commented 8 months ago

Is there anything someone else could do to get this moving again? I'm not confident in my ability to address your buffer/allocation related comments, but I would love to see this get merged in. :)

tvrinssen commented 6 months ago

Hey, uh, sorry for the massive delay. I ran into a conflict when I tried to move the heredoc_block inside the instruction node and couldn't figure out a solution. During christmas and new year, I completely forgot about this. Anyway, enough excuses, I've figured out a solution for the conflict by using a new token (heredoc_nl) instead of a plain \n.

Without the new token, you'd get something like this:

Unresolved conflict for symbol sequence:

  'run_instruction_token1'  shell_command  •  '
'  …

Possible interpretations:

  1:  (run_instruction  'run_instruction_token1'  shell_command  •  run_instruction_repeat2)
  2:  (run_instruction  'run_instruction_token1'  shell_command)  •  '
'  …

Possible resolutions:

  1:  Specify a left or right associativity in `run_instruction`
  2:  Add a conflict for these rules: `run_instruction`

tree-sitter is complaining that a line break at the end of a run_instruction could either branch off into a heredoc_block (since it's repeat($.heredoc_block) it's called run_instruction_repeat2 here) or end the run_instruction since all instructions are chained together with \n in the source_file rule. To avoid the conflict, I replaced \n in the heredoc_block rule with $.heredoc_nl and modified the scanner to only match that token if we're expecting a heredoc after the instruction. Since the external scanner has a higher priority than tree-sitter's internal scanner, this solves the conflict without adding too much complexity.

I've also run tree-sitter generate with tree-sitter 0.22.5 which updated the bindings in addition to the usual files. Feel free to replace those changes with the tree-sitter generate output from an older version if you don't want them.

camdencheek commented 6 months ago

Hey, uh, sorry for the massive delay.

No worries at all! Thanks for coming back to it.

The fixes look good to me -- I found a couple of small issues which I went ahead and fixed before merging. Thanks again for the contribution!

camdencheek / tree-sitter-dockerfile

Support heredocs (closes #4) #45