algesten / jsondiff

Structural JSON diff/patch tool.
Other
66 stars 27 forks source link

object in array might wrongly JsonDiff randomly #14

Closed gewaski closed 10 years ago

gewaski commented 11 years ago

hi, algesten, thanks for your jsondiff tool, I use it pretty well, but some cases might be wrong randomly, could you have a look?

Json1: { "attributes": [ { "name": "k2", "value": "k2v2", "sequence": 1 }, { "name": "k6", "value": "k6v1", "sequence": 1 }, { "name": "k7", "value": "k7v1", "sequence": 1 }, { "name": "k5", "value": "k5v1", "sequence": 1 } ] }

Json2: { "attributes": [ { "name": "k1", "value": "k1v1", "sequence": 1 }, { "name": "k2", "value": "k2v1", "sequence": 1 }, { "name": "k3", "value": "k3v1", "sequence": 1 }, { "name": "k4", "value": "k4v1", "sequence": 1 }, { "name": "k5", "value": "k5v1", "sequence": 1 } ] }

Diff: { "attributes[+0]": { "name": "k1", "value": "k1v1", "sequence": 1 }, "~attributes[-1]": { "value": "k2v2" }, "~attributes[0]": { "name": "k6" }, "~attributes[1]": { "value": "k3v1" }, "~attributes[2]": { "name": "k4", "value": "k4v1" } }

algesten commented 11 years ago

It's tricky. Basically the adjustArrayIndexes() in combination with calcAdjustedIndex() does the wrong thing after the initial array object insertion which causes buildMutationList() to fail matching up insert nodes with delete nodes. Need to think about this one.

gewaski commented 11 years ago

Thanks for your quick response! I think the jsonDiff may have some protection mechanism to ensure the correctness:

  1. some quick way to validate the src1/src2 diff can be applied to src1 correctly to yield to src2.
  2. if step1 failed, try some other ways to generate one another diff which is not so good enough but more stable. for example, from the above case, stores the whole "attributes" section as one diff might be one good "fallback"

Of course JsonDiff tries the best to stop before step2, but the possible risk cannot be afford in some production software. I think...

nachogmd commented 11 years ago

Hi, I think the problem is that the diff algorithm is trying to infer too many things out of the original json, specifically moving indexes from arrays that don't hold enough information to do it in a deterministic way. Actually this test case not only fails for the diff but also for the patch applied to the original json with the "expected" patch.

IMHO the second part of the test should also fail since there seems to me there is an indetermination on wether the SET operation should be applied to the resulting INSERT operation or to the original object.

Imagine an even worse scenario. Arrays that have duplicate values. It could turn into a nightmare!

I don't see any other way to have a deterministic behaviour than using set operations on arrays iff sizes are equal (when generating automatic diffs, of course, not manual patches)

I undestand this would have a great impact on the current code base so I am willing to help if needed :)

algesten commented 11 years ago

Not sure I quite agree with that argument. We don't really care about duplicate values, we only need a set of operations that transform array a -> array b. This is how the underlying diff also works.

The operation order is determined https://github.com/algesten/jsondiff#instruction-order INSERT happens after SET to ensure indexes don't move around.

Have a dig around in the code – I added you as a collaborator – try to think of a way that we can make the transform work (rather than brute force SET).

nachogmd commented 10 years ago

Hi there, I've been recently thinking on using jsondiff extensively as a reverse patch traceability system.

For being able to do so I must make sure that jsondiff behaves consistently in any scenario. I am hitting more and more NPE on fixHalfDeletedArrs when the complexity of the json tree increases. This has made me think deeper on this issue and I think the main problem is that it is not possible to use an arbitrary hashing function to determine if an object has changed or not. This leads to, IMHO, a satisfying solution for any scenario applying the following rule:

Only consider inserting an element on the target when the adjacent elements on target are exactly the same as on origin.

If the user provides a sofisticated hashing function knowing the semantics of the graphs being diffed this last assertion can be transformed safely to:

Only consider inserting an element on the target when the adjacent elements on target provide the same hash as on origin.

In the "crafted" failing example this assertion will produce an automatic diff which is much "uglier" than the proposed one but, IMO, it is a lot more than enough that the proposed patch produces the expected output when applied rather than producing the smallest possible automatic diff.

What are your thoughts?

algesten commented 10 years ago

I guess it comes down to what you mean by this:

when the adjacent elements on target are exactly the same as on origin.

Or more specifically what does exactly the same, mean in terms of hashing? Are we talking a shallow or deep hash?

The principle of the diff is to "flatten" a (normalized) JSON objects to something that can be compared. This is what findLeaves does. Conceptually I think of it like this:

Original structure:

 {a:[1,[2,3]]}

Flattened to (very free form):

  {a: [1, [2,3]]   // object key "a" 
  [1, [2,3]]       // begin array
  [0]: 1           // array index 0
  [1]: [2,3]       // array index 1, begin array
  [0]: 2           // array index 0
  [1]: 3           // array index 1

The incavadiff is a bog standard text diff algorithm, which we can use by having flat documents that can be thought of as "plaintext". In practice the flat documents are arrays and look like this:

[
  LEAF<{"a":[1,[2,3]]}_1454189616>,
  LEAF<[1,[2,3]]_1665342388>,
  [0,d0,i0,0]LEAF<1_86006477>, 
  [1,d0,i0,1]LEAF<[2,3]_139496081>,
  [0,d0,i0,0]LEAF<2_29411217>, 
  [1,d0,i0,1]LEAF<3_29411218>
]

If we have two such flattened documents a and b, and compare them, we get a diff that effectively is a bunch of operations to mutate a to b. This set of instructions is always correct. However the problems occur when we try to translate the incava instruction set to the JSON mutation syntax form.

This also means that when comparing document a and b, we don't want to deep hash or consider array index. If we did, the incavadiff will pretty much always have wildly different flat documents to consider, which means the transformation instructions would be enormous (one design goal is obviously to keep the number of transformation instructions down).

So, back again to what do you consider exactly the same?

nachogmd commented 10 years ago

Hi again and thanks for such a great insight on the real problem ;)

I've been working on a new algorithm which seems to work pretty well in all existing test cases which basically makes Leaf hold a double structure: the original children and the "merged with target" children - had some hard time handling correctly orphan children... All it's now left to do is check for possible cancellations (obj nodes) and the patch is served :)

But now I have another suggestion for helping out creating the patch and, more specifically, helping out apply the patch. Would you consider modifying the patch structure so a node holds an array of the changes on that node instead of using an object? This way we don't have to rebuild indexes so that the patch can correctly order the instructions and, even better, we don't have to order anything when applying the patch.

Thoughts?

nachogmd commented 10 years ago

Sorry, I should have provided an example...:

var orig = {a: [0, 1, 2, { me: "too" }]};
var patch = {
    a: {
        "~": [
                "+0":2
                "1":2
                "-2":0
                "~3": {me: 
                        {~ : [
                            "+too": "three"
                        ]}
        ]}
};

yields (simplifying):

a+0: [2,0,1,2, {me:"too"}]
a1: [2,2,1,2, {me:"too"}]
a-2: [2,2,2, {me:"too"}]
~3:
    ~me: {me: { too: "three"}} 
[2,2,2 {me: {too: "three"}} 

It cant be seen as "operations to be done in this node"

algesten commented 10 years ago

I have no problem with the syntax changing, and what you're suggesting sounds like it could be made by just extending the current syntax.

Currently the following modifier is pointless:

"~key":    [ "replaced" ]  // whole array added/replaced (~ doesn't matter for whole array)

(it's the same as "key": ["replaced"])

So we could use that form to introduce your "ordered instruction set". Your new patch could look like this (not omitting stuff like you did ;) ):

var patch = {
    ~a: [
                {"+0":2},
                {"1":2},
                {"-2":0},
                {"~3": {~me: [
                            "+too": "three"
                        ]}
                }
        ]
};
nachogmd commented 10 years ago

Yes indeed.... After struggling creating the patch I noticed what you said. Impossible to acheive my prev suggestion... I am pushing changes in the gwt branch so you can take a look and see for yourself ;)

algesten commented 10 years ago

Can't get the tests to run there.

$ mvn test
...
[WARN] Unable to create new cache log file /Users/martin/dev/jsondiff/\${webAppDirectory}/../gwt-unitCache/gwt-unitCache-00000142786A308B.

and it creates some strange dirs:

LICENSE.txt          \${webAppDirectory}/ pom.xml              target/
README.md            gwt-unitCache/       src/                 war/
nachogmd commented 10 years ago

it seems to be a knwon issue

http://code.google.com/p/google-web-toolkit/issues/detail?id=6443

Good news is that there is a (hope so) simple workaround that I just pushed!

Try now please. I'll be omitting the patch creation (not working) and leave the printing of the resulting graph on my commit. There are also some duplicate instructions which should be taken care of. Most probably coming from attaching orphan children. I am now too tired to keep on checking but would love a second opinion ;)

nachogmd commented 10 years ago

Hi again,

The new algorithm is working like a charm and the new patch syntax seems also nice to me. Changes I made:

  1. Created new classes for the ones that were inlined - it helps me give replacements in gwt for some limitations it has
  2. Diff tests don't check against a provided diff - too much work to change them all - but rather check that the patch against the original object produces the expected result
  3. Adapted JsonDiff and JsonPatch for these changes

Unfortunately the JsonPatchTest is rather obsolote at this point. I disabled it for the time being.

Have a nice review ;)

algesten commented 10 years ago

Hi there,

First let me just say I am very happy that you're moving this project forward and investing time in it. My comments below may sound negative, and I totally understand if you don't have time to address them.

I still haven't been able to run the tests. A design goal was always to just be able to git clone and mvn test. First test hangs forever and then explodes.

Running foodev.jsondiff.JsonDiffGWTTest
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "btpool0-1"
Tests run: 61, Failures: 0, Errors: 61, Skipped: 0, Time elapsed: 107.836 sec <<< FAILURE!
testNestedChangeAddAfter(foodev.jsondiff.JsonDiffGWTTest)  Time elapsed: 107.675 sec  <<< ERROR!
java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2367)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.ja
  1. Created new classes for the ones that were inlined - it helps me give replacements in gwt for some limitations it has

I don't mind having them in explicit files, but could you elaborate on those GWT limitations. Having duplicates in the resources catalogue seems very hard to maintain to me. Is there any way we can avoid this? It seems to mainly concern the clone methods?

  1. Diff tests don't check against a provided diff - too much work to change them all - but rather check that the patch against the original object produces the expected result

I'm against not keeping the strictness of what the test suite did. Those diffs were carefully checked (by hand) to ensure they made sense. I had several cases during development where the diffed patches would make the correct end result, yet were large and silly. In eclipse I enable the "escape when pasting into string literal" and just quickly double check.

  1. Adapted JsonDiff and JsonPatch for these changes

Unfortunately the JsonPatchTest is rather obsolote at this point. I disabled it for the time being.

JsonPatchTest should work. It's a sense check that the syntax on the README.md works as documented and expected. There are people using this package that don't diff to produce their patches, but rather just author the patch syntax directly.

nachogmd commented 10 years ago

Hi again,

You are not sounding negative at all. I totally agree with everything you said. My aim is to rollout a release so these issues must be tackled. I've been thinking about a possible roadmap and here it goes. If you agree I think I'll find the time this weekend to get it going ;)

  1. Tag current master - rolling a release wouldn't be possible since tests are not working
  2. Make a maven multiproject and change the version number
  3. Separate gson from jackson
  4. Drop static usage and use prototypes
  5. Drop gwt support for the time being - there are some serious issues with hashing that must be thoroughly investigated
  6. Document the new patch syntax
  7. Fix original diff tests
  8. Fix original patch tests
  9. Hopefully, roll out a release :)

I think for point 6 you are better qualified than me and I can imagine you'll easily find time to do it ;)

Some help for number 8 would also be appreciated

I am starting point 1 and opening a new branch to show you this work in progress!

Best regards,

nachogmd commented 10 years ago

What would you suggest to be the new version number? Should we jump to 2.0? It's not going to be compatible so I suggest 2.0

nachogmd commented 10 years ago

Things were a lot easier than I thought. All tests are now passing!

Take a look at the changes I had to make. Checkout branch 2.0. Its pretty simple just comparing the tests.

I must say that this syntax seems more natural to me, specially in complicated trees.

Closing this issue as it is now solved ;)