GumTreeDiff / gumtree

An awesome code differencing tool
https://github.com/GumTreeDiff/gumtree/wiki
GNU Lesser General Public License v3.0
933 stars 174 forks source link

python-treesitter-ng ignore string literals' difference #371

Closed hkandjimi closed 2 months ago

hkandjimi commented 2 months ago

Good day @jrfaller I have being playing around with GumTreeDiff on python files and it seems as though the python-treesitter-ng parser ignores code diff of string literals e.g print('hellow') -> print('hello') is not considered. However, when I do a check the the default parser it recognises the diff. I would also like to find out if there is a way to use the python-treesitter(only) as the generator and not python-treesitter-ng?

jrfaller commented 2 months ago

Hi @hkandjimi !

Thanks a lot for reporting that. There was a huge regression in the interpretation of the rule files that guide the cleaning of the CST generated by tree-sitter when we switched from treesitter Python's to Java's bindings. It is normally resolved in the last commit, you can try it.

Also don't hesitate to stress test our Python tree-sitter backend because it's kind of new :-)

Cheers!

hkandjimi commented 2 months ago

Sweet, tested it and works like charm. Thanks @jrfaller How does one become a contributor to the project, I would be interested in playing around with the python actions and was wondering if say I made a change to the codebase do I just push it or does it need an audit from someone?

jrfaller commented 2 months ago

Hi @hkandjimi! We are always open to good pull-requests ;-)

Don't hesitate to use the repo discussions channel before doing any work to ensure it's in the scope of what we want to achieve in the project. Regarding Python, normally the treesitter-ng backend will be the default one since I want to eliminate as many as possible non-java dependencies. The reweriting of treesitter-ng CST to a GumTree AST is guided by this file : https://github.com/GumTreeDiff/gumtree/blob/main/gen.treesitter-ng/src/main/resources/rules.yml (section python) where we try to eliminate useless nodes (ignored), useless labels of nodes (label_ignored), have node types that are more consistent with what is expected from a diff (aliased), and simplify some node by not traversing their children (flattened). Of course, the Python grammar is vast therefore the current rules are probably not sufficient and it should be a good start to look into it!

Cheers!