GumTreeDiff / gumtree

An awesome code differencing tool
https://github.com/GumTreeDiff/gumtree/wiki
GNU Lesser General Public License v3.0
933 stars 174 forks source link

Unicode Problems? #267

Open codinuum opened 2 years ago

codinuum commented 2 years ago

Gumtree (2656040) failed to parse the following: DeleteMessage.java. It seems a malformed code in the above source caused the failure.

Error while running client 'parse'. java.nio.charset.MalformedInputException: Input length = 1 at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274) at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) at java.base/java.io.InputStreamReader.read(InputStreamReader.java:181) at java.base/java.io.BufferedReader.read1(BufferedReader.java:210) at java.base/java.io.BufferedReader.read(BufferedReader.java:287) at java.base/java.io.BufferedReader.fill(BufferedReader.java:161) at java.base/java.io.BufferedReader.read1(BufferedReader.java:212) at java.base/java.io.BufferedReader.read(BufferedReader.java:287) at java.base/java.io.Reader.read(Reader.java:229) at com.github.gumtreediff.gen.jdt.AbstractJdtTreeGenerator.readerToCharArray(AbstractJdtTreeGenerator.java:44) at com.github.gumtreediff.gen.jdt.AbstractJdtTreeGenerator.generate(AbstractJdtTreeGenerator.java:64) at com.github.gumtreediff.gen.TreeGenerator.generateTree(TreeGenerator.java:41) at com.github.gumtreediff.gen.TreeGenerator$ReaderConfigurator.reader(TreeGenerator.java:119) at com.github.gumtreediff.gen.TreeGenerator$ReaderConfigurator.file(TreeGenerator.java:90) at com.github.gumtreediff.gen.TreeGenerator$ReaderConfigurator.file(TreeGenerator.java:100) at com.github.gumtreediff.gen.TreeGenerators.getTree(TreeGenerators.java:58) at com.github.gumtreediff.gen.TreeGenerators.getTree(TreeGenerators.java:70) at com.github.gumtreediff.client.ParseClient.getTreeContext(ParseClient.java:63) at com.github.gumtreediff.client.ParseClient.run(ParseClient.java:54) at com.github.gumtreediff.client.Run.startClient(Run.java:94) at com.github.gumtreediff.client.Run.main(Run.java:128)

codinuum commented 2 years ago

A possible workaround attached. gumtree-unicode-fix.patch.txt

jrfaller commented 2 years ago

Hi @codinuum! Thanks for reporting this, and for the tentative patch.

Just to be sure when I got

file DeleteMessage.java                                           
DeleteMessage.java: Java source, ISO-8859 text

Did you try to parse the source using this instead of UTF-8, and would it work? Because it might be better to have an option for the charset in this case, no ?

codinuum commented 2 years ago

I tried only UTF-8 for some batch jobs.