Update jsdiff and simplify diff computation

tsmaeder commented 1 month ago

What it does

Fixes #13772

contributed on behalf of STMicroelectronics

How to test

Open electron.d.ts as described in the issue and verify the UI is not blocked for a significant time.
Edit a file and make sure the edits are represented properly in the "dirty diff annotation" you can invoke from the left editor gutter.

Follow-ups

I think the performance is currently good enough on relevant scenarios that we don't have to move the diff computation into a worker, but if we do, that would be a followup.

Review checklist

[x] As an author, I have thoroughly tested my changes and carefully followed the review guidelines

Reminder for reviewers

As a reviewer, I agree to behave in accordance with the review guidelines

tsmaeder commented 1 month ago

@pisv

pisv commented 4 weeks ago

@tsmaeder I have noticed that three of the existing tests in diff-computer.spec.ts now fail. (BTW, it looks like the tests for the scm package are not executed by yarn test, probably due to missing test scripts in scm/package.json.)

  33 passing
  3 failing
  1) dirty-diff-computer
       remove all lines:

      AssertionError: expected { changes: [ { …(2) } ] } to deeply equal { changes: [ { …(2) } ] }
      + expected - actual

       {
         "changes": [
           {
             "currentRange": {
      -        "end": 1
      +        "end": 0
               "start": 0
             }
             "previousRange": {
               "end": 10

  2) dirty-diff-computer
       add lines to empty file:

      AssertionError: expected { changes: [ { …(2) } ] } to deeply equal { changes: [ { …(2) } ] }
      + expected - actual

               "end": 3
               "start": 0
             }
             "previousRange": {
      -        "end": 1
      +        "end": 0
               "start": 0
             }
           }
         ]

  3) dirty-diff-computer
       multiple additions:

      AssertionError: expected { changes: [ { …(2) }, …(2) ] } to deeply equal { changes: [ { …(2) }, …(2) ] }
      + expected - actual

           }
           {
             "currentRange": {
               "end": 12
      -        "start": 11
      +        "start": 12
             }
             "previousRange": {
               "end": 10
               "start": 9

pisv commented 4 weeks ago

From the API point of view, I'm a bit concerned about the removal of readonly modifiers in LineRange. As described in https://github.com/eclipse-theia/theia/pull/13104#discussion_r1534048338, I'd favour a design where the entire DirtyDiff structure is effectively read-only for clients.

tsmaeder commented 3 weeks ago

From the API point of view, I'm a bit concerned about the removal of readonly modifiers in LineRange

It really does not matter, as the diffs are just descriptors into a structure: there is no way we can unsafely manipulate anything using that structure. This is typescript, not Java: typing does not add safety since you can cast anything to anything anyway. Do we want to tell clients that they may not manipulate the returned structure? No, because we don't care about that as we don't keep a reference anyway.

pisv commented 3 weeks ago

Yes, but DirtyDiff is used to notify clients about dirty diff update events, i.e. the same instance can be shared between multiple clients, and some of the clients can (and do) keep a reference to it... Not a big deal probably, but my concern is that the implementation seems to drive (and skew) the design in this case. For example, in the former design you could actually freeze the entire DirtyDiff structure if needed, which would no longer be possible in the new design... Just my 2c.

pisv commented 3 weeks ago

Also, I'm not sure about changing some of the existing tests to match the new implementation; in general, one would expect the other way round. At least, looking at the code it is not clear to me why the tests needed to be fixed. For example, in the third of the changed tests, the computed dirty diff was expected to contain two additions and one removal (as VS Code also computes in this case), but now it is expected to contain two additions and one modification. Am I missing something?

tsmaeder commented 3 weeks ago

@pisv the first two tests were simply wrong:

computeDirtyDiff(
            sequenceOfN(numberOfLines, () => 'TO-BE-REMOVED'),
            [''])

should create a "changed" delta with n lines removed and one (empty) line added. Same idea for the second case. One emtpy line is != no lines.

The last is is, if you ask me, an error in jsdiff: for the test case, it shows as the last two changes an "unchanged" of length one with as the last change a "added" of length one. Correclty, there should be an addition of length two, as we're adding two empty lines at the end. If I replace the emtpy lines with a non-empty string, the algorithm correctly outputs an addition of 2 elements at the end. As far as I can tell, the diff component still shows the correct thing when adding empty lines at the end of a file.

pisv commented 3 weeks ago

@tsmaeder When I comment out the following lines in the former implementation (current master), the diff-computer.spec.ts starts to fail on the very same three tests the new implementation failed (https://github.com/eclipse-theia/theia/pull/13787#issuecomment-2156611184):

https://github.com/eclipse-theia/theia/blob/d79b7101adef9db71e74c13744a3fde4f79a343c/packages/scm/src/browser/dirty-diff/diff-computer.ts#L41-L57

Notably, it is these lines that made the former implementation look more complex than it actually was, as noted in https://github.com/eclipse-theia/theia/pull/13104#discussion_r1534048338. And I guess these lines were there for a reason.

tsmaeder commented 3 weeks ago

@pisv actually, the last test is correct, too: the diff from jsdiff, looks like this:

[
  {
    "count": 5,
    "value": [
      "first line",
      "",
      "foo changed on master",
      "bar changed on master",
      ""
    ]
  },
  {
    "count": 1,
    "added": true,
    "value": [
      "NEW TEXT"
    ]
  },
  {
    "count": 3,
    "value": [
      "",
      "",
      ""
    ]
  },
  {
    "count": 1,
    "added": true,
    "value": [
      "last line"
    ]
  },
  {
    "count": 1,
    "value": [
      ""
    ]
  },
  {
    "count": 1,
    "removed": true,
    "value": [
      "last line"
    ]
  },
  {
    "count": 1,
    "added": true,
    "value": [
      ""
    ]
  }
]

This (correctly, IMO) translates into two inserts and a change with the indices mentioned above.

The fact that the tests did not fail before does not mean that the code was correct before: it just means that the tests expected the output of the actual implementation. Can you make an argument as to why the special cases were necessary before? The comments seem to be related to handling adds/removes at the beginning and end, which is not what is happening here: the sequence of edits is simply not what we would expect as humans.

pisv commented 3 weeks ago

Regarding the last test, here is what VS Code computes in this case:

This also is computed by Theia on the current master.

And here is what the new implementation computes:

As you can see, there is a difference. My main point, however, is that if you are really willing to remove this handling of the special cases in the code, there would be effectively no need to simplify the former implementation any further and no need to change the former design by removing the readonly modifiers in LineRange...

tsmaeder commented 3 weeks ago

@pisv what Theia does is correct as far as the underlying diff algorithm is concerned. If VS Code does something different, it's probably because it's not using jsdiff. "What VS Code" does is simply not an argument in this case. I still maintain the simplified behaviour is correct. The "special cases" are probably a workaround for the behaviour of jsdiff. Unfortunately, we won't ever be able to tell, because there is no reference to why this workaround exists and why we have never tried to get jsdiff to "fix" their implementation. That's why I insist on referencing relevant information (like issues) when the code does something which is unexpected and due to external differences. Anywas...I'm merging this one as a pure version update since we want it in the next IDE release.

pisv commented 3 weeks ago

@tsmaeder Thank you very much for updating jsdiff. This is really appreciated.

I also applaud you for taking your time with the other changes.

On the current master, after jsdiff update, the same tests still fail after removing the lines 41-57 in diff-computer.ts, which are responsible for handling the special cases.

If this were really a workaround for some jsdiff behavior, it can then be concluded that this behavior is still present in jsdiff; otherwise, the tests would not fail.

But it could also be supposed that it might have been an attempt to replicate some VS Code behavior in Theia.

Looking at the git history, the special cases in the code were there from the very beginning (https://github.com/eclipse-theia/theia/commit/8ab7d1d19e24e21b10ff19668a3b12aa9906d250#diff-1733db5110f7ae166c6649af4f89c6f2a17c25819ad5f74671c182642449f734).

It would be great if @vinokurig as the author of the original code could comment on the intent, but if that is not possible, we can only guess about the actual reason why these special cases are there.

However, objectively we have some very specific code that is covered with tests that are starting to fail when that code is removed.

At the risk of repeating myself, my main point is that if you really think that it would be safe to remove these special cases from the code, the current implementation would become:

    computeDirtyDiff(previous: ContentLinesArrayLike, current: ContentLinesArrayLike): DirtyDiff {
        const changes: Change[] = [];
        const diffResult = this.computeDiff(previous, current);
        let currentRevisionLine = -1;
        let previousRevisionLine = -1;
        for (let i = 0; i < diffResult.length; i++) {
            const change = diffResult[i];
            const next = diffResult[i + 1];
            if (change.added) {
                // case: addition
                changes.push({ previousRange: LineRange.createEmptyLineRange(previousRevisionLine + 1), currentRange: toLineRange(change) });
                currentRevisionLine += change.count!;
            } else if (change.removed && next && next.added) {
                // case: modification
                changes.push({ previousRange: toLineRange(change), currentRange: toLineRange(next) });
                currentRevisionLine += next.count!;
                previousRevisionLine += change.count!;
                i++; // consume next eagerly
            } else if (change.removed && !(next && next.added)) {
                // case: removal
                changes.push({ previousRange: toLineRange(change), currentRange: LineRange.createEmptyLineRange(currentRevisionLine + 1) });
                previousRevisionLine += change.count!;
            } else {
                // case: unchanged region
                currentRevisionLine += change.count!;
                previousRevisionLine += change.count!;
            }
        }
        return { changes };
    }

which (IMO) is no less simple or correct than the implementation you have proposed, but allows for the entire DirtyDiff structure to remain read-only for clients from the API point of view.

I shall probably refrain from further comments as I don't think I can really add anything more to this discussion. I will trust the committers doing the right thing. Thank you very much for your attention and patience.

eclipse-theia / theia