dolthub / dolt

Dolt – Git for Data
https://www.dolthub.com
Apache License 2.0
18k stars 516 forks source link

Fix issue where JSON diff would fail under specific circumstances. #8534

Closed nicktobey closed 3 weeks ago

nicktobey commented 3 weeks ago

If all of the following are true during a JSON diff operation:

  1. One document fits in a single chunk and the other doesn't
  2. The location of the chunk boundary in the larger document is also present in the smaller document. (It doesn't correspond to a location that was added or removed)
  3. The chunk boundary falls at the end of a value in the document (before the next key or comma/right brace/etc)

Then the differ would fail to advance the prolly tree cursor and would incorrectly see the larger document as corrupt.

This fixes that issue and improves the error messaging to make it seem less like the database is corrupt, since the error is much more likely to be a parsing bug like this one.

nicktobey commented 3 weeks ago

To explain why this happened and why this PR fixes the issue:

The JSON differ sits on top of a prolly tree differ, fetching blocks from the two prolly trees and scanning them while keeping track of the current location in both documents. The JSON scanner takes a reference to the underlying prolly tree cursor and advances it if it hits the end of a block: this is a code smell because it breaks encapsulation for the prolly tree differ, but it was the best way to keep both data structures in sync.

In the event that one or both documents fit entirely within a single block, we take a slightly different code path: we don't need to use the prolly tree differ anymore, so we create the JSON scanner without taking a reference to the prolly tree cursor.

The problem arose when one of the documents fit in a single block while the other one didn't. In that case, we still didn't need to use the prolly tree differ, so we would instantiate both scanners without taking a reference to the differ's cursors. However, if the JSON scanner hit a chunk boundary in the larger document while it wasn't in the middle of computing a diff (both documents were the same at the location where the chunk boundary occurred), we would need to fetch the next block of the larger document, and we would use the cursor originally created for the prolly tree differ to do so. Since the JSON scanner never took a reference to this cursor, it never advanced it, and the cursor still pointed to the beginning of the document. The scanner would subsequently fetch the first block in the document while it was expecting the next block and try to resume scanning from there: the scanner would quickly detect that it was reading unexpected values and would incorrectly assume that the underlying storage was corrupt.

The workaround in the PR is to make sure that if only one of the documents being diffed fits in a single chunk, we still create a JSON scanner for the other document that has access to the original underlying cursor and can advance it. I'm not 100% satisfied: a cleaner solution might be to avoid referencing that cursor at all if we're not using the differ, but that would make the JSON differ's already complicated state machine even more complicated.

Fixing this has me a bit worried that there might be other bugs in here. In particular I'm not convinced that we always do the correct thing when only conditions (1) and (2) above hold, but I'd need to look at it more and try to come up with an exapmle case.

coffeegoddd commented 3 weeks ago

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
ad631c6 ok 5937457
version total_tests
ad631c6 5937457
correctness_percentage
100.0