Max call stack size exceeded on large sets

lencioni commented 8 years ago

Hi! Thanks for releasing adiff! I am using it to do some image diffing over in Happo, and have noticed that on some large data sets we are exceeding the max call stack size. It looks like this is coming from within adiff itself.

https://github.com/Galooshi/happo/issues/154

My hunch is that it is related to the recursive nature of the lcs function, but I haven't spent much time with the adiff code so I could be way off. Any thoughts?

cc @trotzig

dominictarr commented 8 years ago

Oh, this is surely not the right algorithm to use for images. It's intended for diffing text. You can model text as a list, and represent diffs as insertions or deletes - but that doesn't really make sense for images.

I think for images, I'd just do something like subtract one from another, pixel by pixel, but I bet there is probably modules for that. Really, i'd suggest that happo detects whether a file is an image and then uses a different algorithm for that file.

lencioni commented 8 years ago

We've found that the LCS diff algorithm works well for our images. We have each row of pixels as an array of integers representing the colors. We then convert each row to a string and feed it through the diff, which works quite well. This allows us to see rows that were added, changed, or removed, which really helps keep the signal high in the diffs.

dominictarr commented 8 years ago

Sure, but how often are you gonna take an image and insert or delete a row of pixels? 99.99% of the time you are gonna overwrite pixels instead, that means you could diff an image with a much more efficient algorithm. Sure, LCS will still find that, it just won't do it very efficiently.

I would be happy to take a PR that optimizes this module, such as making it iterative instead of recursive, but it's probably gonna be less efficient than a dedicated image differ.

Hmm, I wonder if you could handle a special case where it would detect if all updates where overwrites cheaply, and then check whether that also gave the correct Longest Common Subsequence?

lencioni commented 8 years ago

We are using this to test UI elements. In practice, adding or deleting rows happens pretty frequently. E.g. if you tweak some margins or paddings, or add some content to the top of a component.

I would be happy to take a PR that optimizes this module, such as making it iterative instead of recursive, but it's probably gonna be less efficient than a dedicated image differ.

Agreed! I'd like to see how close we can get with a general purpose tool before going down that route. Thanks for being open to PRs to optimize. This isn't currently our highest priority, so we likely won't be sending anything your way immediately, but I anticipate that once we work through our list of things we want to do we'll start digging into optimizing over here.

Hmm, I wonder if you could handle a special case where it would detect if all updates where overwrites cheaply, and then check whether that also gave the correct Longest Common Subsequence?

Possibly. I haven't spent much time reading the code here or digging into the specifics of the algorithm. If you have any cool ideas that you want to try out, I'd be happy to test them locally on our stuff and report back perf numbers.

dominictarr commented 8 years ago

oh, hang on, I think I just got some missing context - are you using this to diff screenshots of web pages? because then diffing pixel rows totally makes sense.

lencioni commented 8 years ago

Yeah, potentially pages, but typically smaller bits of pages, like components. Of course, you could have a component that composes a bunch of components into a page which you could run through Happo, but that probably won't be the majority use case. :)

lencioni commented 8 years ago

Also, more context if you care: we write the screenshots into canvas elements, read the image data (which comes out as a Uint8ClampedArray where each element is a red, green, blue, or alpha channel 0-255). We then convert this into a 2D array of Uint8ClampedArrays so it is broken up into rows. Before running it through adiff, we JSON.stringify each row array to help optimize, and then send the arrays of strings through adiff. When then use the info from adiff and the pixels from the before/after images to draw a diff image.

dominictarr commented 8 years ago

Thats interesting. one approach might be that you don't really need a the optimal longest common substring, a substring that is pretty long would be good enough.

maybe you could apply some heuristic, and still get a good result, like simply limiting the maximum recursion depth?

lencioni commented 8 years ago

Hi! I just wanted to close the loop on this. We ended up rolling our own, iterative LCS implementation.

https://github.com/Galooshi/happo/pull/158/commits/999cc2bde7e30f3864589851b6f313f875fe7a52

I think we are going to continue down that path and try to further optimize it for our use-case (e.g. limiting scanning to +/- 100 lines), so it is unlikely that we'll be submitting PRs here. Feel free to close this issue out or whatever.

Thanks for adiff! It helped us get to where we are now. Have a great day!

dominictarr commented 8 years ago

looks like good stuff! you should publish it as a module! happy to take a PR that is just a link to it.

lencioni commented 8 years ago

Thanks! That's likely what we will do, but we want to make sure it is stable before extracting it.

dominictarr commented 8 years ago

okay, great!

dominictarr / adiff

Max call stack size exceeded on large sets #7