Wilfred / difftastic

a structural diff that understands syntax 🟥🟩
https://difftastic.wilfred.me.uk/
MIT License
20.54k stars 331 forks source link

Significant whitespace changes (e.g. in Python) are ignored #587

Open lazy opened 11 months ago

lazy commented 11 months ago

image

I verified via --dump-syntax that ASTs are significantly different. Can compare them using pr -m -t -w 160 <(difft --dump-syntax a.py) <(difft --dump-syntax b.py)

Issue duplicated as text:

$ cat a.py
def test(a, b):
    if a:
        print(a)
        if b:
            print (b)
$ cat b.py
def test(a, b):
    if a:
        print(a)
    if b:
        print (b)
$ diff -u a.py b.py
--- a.py        2023-10-25 12:42:54.671789363 +0000
+++ b.py        2023-10-25 12:43:07.762393299 +0000
@@ -1,5 +1,5 @@
 def test(a, b):
     if a:
         print(a)
-        if b:
-            print (b)
+    if b:
+        print (b)
$ difft a.py b.py
b.py --- Python
No syntactic changes.

Using the latest x64 build from releases page in github.

$ difft --version
Difftastic 0.52.0 (b07e519 2023-10-08)
Wilfred commented 11 months ago

Oh my goodness, that's awful. Thanks for the report.

Wilfred commented 11 months ago

This is a display issue: the diffing logic finds changes, but since Python uses indentation for blocks it's not finding anything to display in this case.

$ DFT_LOG=trace difft a.py b.py                                          
 2023-10-25T15:51:54.861Z INFO  difft::options > CLI arguments: ["a.py", "b.py"]
 2023-10-25T15:51:54.875Z INFO  difft::diff::dijkstra > LHS nodes: 16 (1 toplevel), RHS nodes: 16 (1 toplevel)
 2023-10-25T15:51:54.875Z INFO  difft::diff::dijkstra > Saw 58 vertices (a Vertex is 80 bytes), arena consumed 7424 bytes, with 33 vertices left on heap.
 2023-10-25T15:51:54.875Z DEBUG difft::diff::dijkstra > Found a path of 9 with cost 709.
 2023-10-25T15:51:54.875Z DEBUG difft::diff::dijkstra > Initial 5 items on path: [
    "line:2  ...          line:2  ...          --- 100 EnterUnchangedDelimiter { depth_difference: 0 }",
    "line:2 if            line:2  ...          --- 300 EnterNovelDelimiterRHS",
Wilfred commented 10 months ago

The display logic is assuming that you can't have a structural change without having some novel tokens, which isn't true here.

Every token has a matching token on the other side, which is interpreted as no changes. See hunks::sorted_novel_positions.

Wilfred commented 10 months ago

E.g. here's how delimiters are handled in the analogous JS.

Screenshot from 2023-11-11 10-59-34

Wilfred commented 10 months ago

Should probably mark the first node in the changed subtree as novel when this occurs.

ztane commented 6 months ago

The Python standard library tokenizer module (which you can test with python3 -mtokenize program.py) does produce INDENT and DEDENT tokens; however the dedent tokens it produces have always zero-width lexed content, yes, but the tokenizer marks them found at the new indent depth; however the lexed text for indent token contains the total whitespace from column 1 to the beginning of indented block.

Smaug123 commented 2 months ago

Would you like a separate issue filed for the corresponding error in F#, which is also whitespace-sensitive? I couldn't find one from a quick search, but would like to note the following test case (translated from the Python original).

let test a b =
    if a then
        printfn "%b" a
        if b then
            printfn "%b" b

and

let test a b =
    if a then
        printfn "%b" a
    if b then
        printfn "%b" b

displays the same "No syntactic changes" (Difftastic 0.59.0 (built with rustc 1.78.0)).

Wilfred commented 2 months ago

Thanks for the sample code 🙂

No, fixing this will affect all languages that have significant whitespace, it doesn't need a separate issue.