flipkart-incubator / zjsonpatch

This is an implementation of RFC 6902 JSON Patch written in Java
Apache License 2.0
524 stars 150 forks source link

Memory issue with 0.4.0+ #77

Closed ov7a closed 6 years ago

ov7a commented 6 years ago

It seems that In newer versions (0.4.0+) JsonDiff consumes significantly more memory. Most probably this is related to #60 and getLCS method.

For version 0.4.4:

An exception or error caused a run to abort: Java heap space 
java.lang.OutOfMemoryError: Java heap space
    at com.flipkart.zjsonpatch.InternalUtils.longestCommonSubsequence(InternalUtils.java:31)
    at com.flipkart.zjsonpatch.JsonDiff.getLCS(JsonDiff.java:446)
    at com.flipkart.zjsonpatch.JsonDiff.compareArray(JsonDiff.java:337)
    at com.flipkart.zjsonpatch.JsonDiff.generateDiffs(JsonDiff.java:324)
    at com.flipkart.zjsonpatch.JsonDiff.compareObjects(JsonDiff.java:425)
    at com.flipkart.zjsonpatch.JsonDiff.generateDiffs(JsonDiff.java:327)
    at com.flipkart.zjsonpatch.JsonDiff.asJson(JsonDiff.java:47)
    at com.flipkart.zjsonpatch.JsonDiff.asJson(JsonDiff.java:38)

I'm comparing two quite large jsons, around 700 Kb each, with around 300Mb of memory. It is caused by comparing two arrays with around 40 000 of strings each.

Version 0.3.10 works just fine.

LouizFC commented 6 years ago

If I am correct 0.4.0 is the update that removed Apache Commons.

The way @vishwakarma implemented the current algorithm uses an array for collecting the diffs and then produces the result.

If I am not wrong, the LCS algorithm from Apache uses a kind of cursor called "Snake" that accumulates the result just by transversing the two lists without creating extra objects, just adding a counter.

I will take a look in the 0.3.10 and the current version using a profiler. Do you have some dummy data so I can replicate it?

ov7a commented 6 years ago

I've made a MWE in gist.

vishwakarma commented 6 years ago

@ov7a Agreed, it can because of the nature of LCS algorithm ( internally used in case of arrays ) due to its runtime & spacetime complexity. I will work on it with your example as benchmark to improve it. @LouizFC , kindly let me know if you need any assistance here on this, meanwhile i will be also working on it and will get in touch with for the alternates. @ov7a Thank you for bringing this issue.

LouizFC commented 6 years ago

@vishwakarma I did not have the time for profiling it yet, but I checked the Apache implementation (more importantly, this) and it uses two arrays as "buffers" instead of a matrix (our implementation). Maybe thats the cause? I will investigate it further.

I had just a quick look, so this is more a guess than anything else. I will try to take a serious look at this on the weekend.

LouizFC commented 6 years ago

The problem is more severe than I though.

But first, a brief disclaimer: I rarely use profilers. In this case I used VirtualVM. So if the results are a bit off I beg your pardon.

0.3.10

@vishwakarma What are your thoughts. Should we try to reintroduce the Apache Commons dependency or solve it with our own code?

In my opinion, we should reintroduce Apache Collections for now. I was digging into Apache Collections and the way they produce their EditScript is really fascinating, we could learn a lot from there and apply the same techniques in the rest of our library in the future.

Edit: Fixed Markdown for images.

Edit2: I tested some more and the 10 MB difference from 0.3.10 and 0.4.4 (with Apache) is due to the better string concatenation and the precompiled Regexes that were introduced

vishwakarma commented 6 years ago

I agree, let's add Apache library for now to unblock the users and fix the reported issue. That will buy us some time to look into Apache implementation and fix with our own code. Please add the apache library back, I will review and test it. Thanks

On Mon, Aug 20, 2018, 9:07 AM Luiz Felipe da Cruz Oliveira < notifications@github.com> wrote:

The problem is more severe than I though.

But first, a brief disclaimer: I rarely use profilers. In this case I used VirtualVM. So if the results are a bit off I beg your pardon. 0.3.10

0.4.4

@vishwakarma https://github.com/vishwakarma What are your thoughts. Should we try to reintroduce the Apache Commons dependency or solve it with our own code?

In my opinion, we should reintroduce Apache Collections for now. I was digging into Apache Collections and the way they produce their EditScript is really fascinating, we could learn a lot from there and apply the same techniques in the rest of our library in the future.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/flipkart-incubator/zjsonpatch/issues/77#issuecomment-414188768, or mute the thread https://github.com/notifications/unsubscribe-auth/AB-YqbXl34RjjuWbDGUDTgJBsSxPfZaJks5uSi7ogaJpZM4V95xU .

vishwakarma commented 6 years ago

Fixed by PR #78