bonede / tree-sitter-ng

Next generation Tree Sitter Java binding.
MIT License
59 stars 8 forks source link

Unable to parse emoji characters in source code correctly. #36

Closed anooakv closed 5 days ago

anooakv commented 1 month ago

Hi, I am trying to parse json string with an emoji character as it is given in the example. https://github.com/bonede/tree-sitter-ng?tab=readme-ov-file#api-tour

json = "[1, 😥]"

and when i am trying to extract text from the matched TSNode. It's giving 2 extra null values. To get text, i used this method https://github.com/bonede/tree-sitter-ng/issues/19

Code:

public static void main(String[] args) {
            String jsonSource = "[1, 😥]";
            TSParser parser = new TSParser();
            TSLanguage json = new TreeSitterJson();
            parser.setLanguage(json);
            parser.parseStringEncoding(null, jsonSource, TSInputEncoding.TSInputEncodingUTF8);
            TSTree tree = parser.parseString( null, jsonSource);
            // traverse the AST tree with DOM like APIs
            TSNode rootNode = tree.getRootNode();
            // or travers the AST with cursor
            TSTreeCursor rootCursor = new TSTreeCursor(rootNode);
            rootCursor.gotoFirstChild();
            // or query the AST with S-expression
            TSQuery query = new TSQuery(json, "((document) @root)");
            TSQueryCursor cursor = new TSQueryCursor();
            cursor.exec(query, rootNode);
            TSQueryMatch match = new TSQueryMatch();
            while(cursor.nextMatch(match)){
                // do something with the match
                TSQueryCapture[] captures = match.getCaptures();
                System.out.println("---------------------------------");
                for (TSQueryCapture capture : captures) {
                    TSNode node = capture.getNode();
                    byte[] bytes = jsonSource.getBytes(StandardCharsets.UTF_8);
                    System.out.println("Byte array size: "+bytes.length);
                    int startByte = node.getStartByte();
                    int endByte = node.getEndByte();
                    System.out.println("start: "+startByte+", end: "+endByte);
                    byte[] nodeBytes = Arrays.copyOfRange(bytes, startByte, endByte);
                    String text = new String(nodeBytes, StandardCharsets.UTF_8);
                    System.out.println("Node Text:\n " + text);
                }
            }
        }

Output:

Byte array size: 9
start: 0, end: 11
Node Text:
 [1, 😥]nullnull

and there is also mismatch in encoding. Encoded byte array size is 9 which is expected but TSnode startByte = 0, endByte = 11 which is 2 extra bytes that is causing error in getting code text correctly. Can you please help, how can i parse and extract such code correctly.

sepatel commented 1 month ago

Encoded byte array size is 9 which is expected but TSnode startByte = 0, endByte = 11 which is 2 extra bytes that is causing error in getting code text correctly. Can you please help, how can i parse and extract such code correctly.

Have you tried upgrading to 0.22.6a. We had that problem until we upgraded, we did have to put in a hack to suppress the 0.22.6 version from being used so that it would use the 0.22.6a version instead. But that fixed this exact same issue we had with unicode characters.

anooakv commented 1 month ago

I upgraded to tree-sitter version to 0.22.6a and using tree-sitter-json:0.21.0a grammer.

dependencies {
    // add tree sitter
    implementation 'io.github.bonede:tree-sitter:0.22.6a'
    implementation("io.github.bonede:tree-sitter-json:0.21.0a")
}

because of tree-sitter-json:0.21.0a, gradle also downloads tree-sitter:0.22.6 instead of :tree-sitter:0.22.6a which creates a conflict between both versions. older version 0.22.6 is defined in this pom file.

https://repo.maven.apache.org/maven2/io/github/bonede/tree-sitter-json/0.21.0a/tree-sitter-json-0.21.0a.pom

and at runtime older version 0.22.6 is picked up and still same error is showing while parsing string with emojis. Please help in resolving this issue.

bonede commented 1 month ago

use


 implementation 'io.github.bonede:tree-sitter:0.22.6a!!'

$ ./gradlew dependencies --configuration runtimeClasspath

> Task :dependencies

------------------------------------------------------------
Root project 'hello-world'
------------------------------------------------------------

runtimeClasspath - Runtime classpath of source set 'main'.
+--- io.github.bonede:tree-sitter:{strictly 0.22.6a} -> 0.22.6a
\--- io.github.bonede:tree-sitter-json:0.21.0a
     \--- io.github.bonede:tree-sitter:0.22.6 -> 0.22.6a

(*) - dependencies omitted (listed previously)

A web-based, searchable dependency report is available by adding the --scan option.

BUILD SUCCESSFUL in 2s
1 actionable task: 1 executed
shiomiyan commented 1 month ago

Probably the same issue (I am not familiar with the Java ecosystem, so please forgive me if I have made any mistakes).

In my case, I use a JavaScript parser (tree-sitter-javascript) to reproduce it. The following code reproduces the issue:

package com.example;

import static org.assertj.core.api.Assertions.assertThat;

import java.nio.charset.StandardCharsets;
import java.util.Arrays;

import org.junit.jupiter.api.Test;
import org.treesitter.TSInputEncoding;
import org.treesitter.TSLanguage;
import org.treesitter.TSNode;
import org.treesitter.TSParser;
import org.treesitter.TSTree;
import org.treesitter.TreeSitterJavascript;

public class ExampleTest {
    @Test
    void emojiTest() {
        TSParser parser = new TSParser();
        TSLanguage javascript = new TreeSitterJavascript();
        parser.setLanguage(javascript);

        String code = """
                // 😭
                foo();
                """;

        TSTree tree = parser.parseStringEncoding(null, code, TSInputEncoding.TSInputEncodingUTF8);
        TSNode rootNode = tree.getRootNode();

        TSNode commentNode = rootNode.getChild(0);
        int startByte = commentNode.getStartByte();
        int endByte = commentNode.getEndByte();

        byte[] codeBytes = code.getBytes(StandardCharsets.UTF_8);
        byte[] commentNodeBytes = Arrays.copyOfRange(codeBytes, startByte, endByte);
        String commentString = new String(commentNodeBytes, StandardCharsets.UTF_8);

        assertThat(commentString).isEqualTo("// 😭");
    }
}

Test result is as follows:

expected: 
  "// 😭"
 but was: 
  "// 😭
  f"

And my pom.xml is as follows:

(SNIP)
    <dependencies>
        <dependency>
            <groupId>io.github.bonede</groupId>
            <artifactId>tree-sitter-javascript</artifactId>
            <version>0.21.2</version>
            <exclusions>
                <exclusion>
                    <groupId>io.github.bonede</groupId>
                    <artifactId>tree-sitter</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>io.github.bonede</groupId>
            <artifactId>tree-sitter</artifactId>
            <version>0.22.6a</version>
        </dependency>
        (SNIP)
    </dependencies>
(SNIP)

Running ./mvnw dependency:tree in this case produces the following output:

â–¶ ./mvnw dependency:tree
[INFO] Scanning for projects...
[INFO] 
[INFO] --------------------------< com.example:demo >--------------------------
[INFO] Building demo 1.0-SNAPSHOT
[INFO]   from pom.xml
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- dependency:3.7.0:tree (default-cli) @ demo ---
[INFO] com.example:demo:jar:1.0-SNAPSHOT
[INFO] +- io.github.bonede:tree-sitter-javascript:jar:0.21.2:compile
[INFO] +- io.github.bonede:tree-sitter:jar:0.22.6a:compile
[INFO] +- org.junit.jupiter:junit-jupiter-api:jar:5.10.3:test
[INFO] |  +- org.opentest4j:opentest4j:jar:1.3.0:test
[INFO] |  +- org.junit.platform:junit-platform-commons:jar:1.10.3:test
[INFO] |  \- org.apiguardian:apiguardian-api:jar:1.1.2:test
[INFO] \- org.assertj:assertj-core:jar:3.26.3:test
[INFO]    \- net.bytebuddy:byte-buddy:jar:1.14.18:test
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  1.177 s
[INFO] Finished at: 2024-07-16T19:13:50+09:00
[INFO] ------------------------------------------------------------------------
shiomiyan commented 1 month ago

I used parseString instead of parseStringEncoding and the problem was solved. Sorry for the disturbance.

bonede commented 1 month ago

Please upgrade to 0.22.6.1. This should fix the versioning issue.