Closed tvrinssen closed 6 months ago
Very exciting, thanks for the PR! Will work through a review in the next couple days
@tvrinssen @camdencheek Thank you for your fixing! I faced the error and this PR looks to resolve it. I'm waiting to merge this PR ✨
Error in decoration provider treesitter/highlighter.win:
Error executing lua: ...1/.local/share/nvim/runtime/lua/vim/treesitter/query.lua:248: Query error at 9:4. Invalid node type "heredoc_end":
(heredoc_end) @injection.language)
https://github.com/camdencheek/tree-sitter-dockerfile/issues/4
@tvrinssen do you plan to follow up on this PR? If not, I'm happy to make the required changes, just wanted to give you the chance to take a pass at it first.
Is there anything someone else could do to get this moving again? I'm not confident in my ability to address your buffer/allocation related comments, but I would love to see this get merged in. :)
Hey, uh, sorry for the massive delay.
I ran into a conflict when I tried to move the heredoc_block
inside the instruction node and couldn't figure out a solution. During christmas and new year, I completely forgot about this.
Anyway, enough excuses, I've figured out a solution for the conflict by using a new token (heredoc_nl
) instead of a plain \n
.
Without the new token, you'd get something like this:
Unresolved conflict for symbol sequence:
'run_instruction_token1' shell_command • '
' …
Possible interpretations:
1: (run_instruction 'run_instruction_token1' shell_command • run_instruction_repeat2)
2: (run_instruction 'run_instruction_token1' shell_command) • '
' …
Possible resolutions:
1: Specify a left or right associativity in `run_instruction`
2: Add a conflict for these rules: `run_instruction`
tree-sitter is complaining that a line break at the end of a run_instruction
could either branch off into a heredoc_block
(since it's repeat($.heredoc_block)
it's called run_instruction_repeat2
here) or end the run_instruction
since all instructions are chained together with \n
in the source_file
rule.
To avoid the conflict, I replaced \n
in the heredoc_block
rule with $.heredoc_nl
and modified the scanner to only match that token if we're expecting a heredoc after the instruction. Since the external scanner has a higher priority than tree-sitter's internal scanner, this solves the conflict without adding too much complexity.
I've also run tree-sitter generate
with tree-sitter 0.22.5 which updated the bindings in addition to the usual files. Feel free to replace those changes with the tree-sitter generate
output from an older version if you don't want them.
Hey, uh, sorry for the massive delay.
No worries at all! Thanks for coming back to it.
The fixes look good to me -- I found a couple of small issues which I went ahead and fixed before merging. Thanks again for the contribution!
This PR adds support for heredocs (the
<<EOF
blocks) using an external scanner written C. I used C instead of C++ to avoid introducing new dependencies (like libstdcxx).The scanner currently has two arbitrary limits that I've added to keep the memory management simple: It can only track up to 10 open heredocs at a time. I've never seen anyone use more than 3 so this seems more than enough and having a static size for the array means I don't have to resize at any point. The second limit stems from the fact that C doesn't have a proper string type (like
std::string
) and I didn't want to dynamically resize the memory so I just reserve 128 bytes for the heredoc marker (the string behind the<<
characters). This seems like a fair compromise that reserves enough space for any markers I've actually seen used (most are far shorter likeEOT
,EOF
,SCRIPT
, etc.) and wasting space with oversized allocations. Finally, even if I removed these limits, we'd still be limited to 1024 bytes for our internal state by tree-sitter (TREE_SITTER_SERIALIZATION_BUFFER_SIZE
).With that out of the way, here are the important implementation details:
heredoc_marker
,heredoc_content
andheredoc_end
.error_sentinel
is only used to detect error states as recommended by tree-sitter's documentation.heredoc_marker
is roughly equivalent to the regex<<-?(?:"[^"]+"|'[^']+'|[^\s]+)
. However, it also stores any matched marker in the scanner state for end marker matching.heredoc_content
is equivalent to[^\n]*]
but only matches ifheredoc_end
doesn't match.heredoc_end
only matches the currently active heredoc marker and removes it from the list of active markers once it matches.scan_marker
) or it's parsing a heredoc (scan_content
). The main entry point for the scanner istree_sitter_dockerfile_external_scanner_scan
which decides which mode to use based on the currently valid symbols. If tree-sitter is performing error recovery, we decide the mode based on whether we're currently inside a heredoc or not.I've updated the grammar to add the
heredoc_marker
symbol where relevant and I've added a newheredoc_block
rule to match theheredoc_content
andheredoc_end
symbols between instructions. To make sure that the scanner is working as intended, I've also added some heredoc examples to the test corpus. I wrote a few myself and the rest are taken from BuildKit's unit tests. Finally, I also added the new symbols tohighlights.scm
.tree-sitter test
passes and I've been this with Helix to edit many differentDockerfile
s without issues. Even complex cases like the following work fine: