RenjiSann / tree-sitter-xml

XML Grammar for Tree-Sitter
MIT License
2 stars 3 forks source link

xml-stylesheet tag isn't highlighted and breaks everything that follows #1

Open kanashimia opened 7 months ago

kanashimia commented 7 months ago
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="test.xsl"?>
<doc>
  <title>Document Title</title>
  <chapter>
    <title>Chapter Title</title>
    <section>
      <title>Section Title</title>
      <para>This is a test.</para>
      <note>This is a note.</note>
    </section>
    <section>
      <title>Another Section Title</title>
      <para>This is <emph>another</emph> test.</para>
      <note>This is another note.</note>
    </section>
  </chapter>
</doc>

image

The issue was in Helix, they use this grammar as I see.

datho7561 commented 7 months ago

I also just ran into this in Helix. As a work around in Helix, you can set the language to HTML, but it would be nice to fix this grammar.

It seems adding any <? ... ?> construct that isn't the xml one at the beginning of the document breaks parsing.

datho7561 commented 7 months ago

here's the problem: https://github.com/RenjiSann/tree-sitter-xml/blob/main/grammar.js#L79

the spec uses - to represent taking the complement between the two sets. In this case, I think it means any string of length 0 or greater except for any strings of length 0 or greater that contain the substring ?>. However, in the current grammar, it's parsing the literal character -. I can't think of a way to translate the spec into a regex that doesn't use lookahead. I'm guessing the best bet is to use a heuristic like [^?]* to match the processing instruction content.