chbrown / rfc6902

Complete implementation of RFC6902 in TypeScript
https://chbrown.github.io/rfc6902/
322 stars 39 forks source link

limit depth of generated patches #29

Closed diachedelic closed 6 years ago

diachedelic commented 6 years ago

I am creating patches to represent changes made to ~100KB GeoJSON objects. I tried every single module I could find which generates rfc6902-compliant patches, but only chbrown/rfc6902 gave sane results for my use case (due to the recursion in diffAny()). However, I need a way to limit the depth of the patches.

For example, suppose I had a GeoJSON LineString, which stores the coordinates of each point (lat/lng) in its own array:

{
  geometry: {
    type: 'LineString',
    coordinates: [
      [0, 0],
      [1, 1],
      [2, 2],
      [3, 3],
    ],
  },
}

Now I modify it by changing a few coordinates:

{
  geometry: {
    type: 'LineString',
    coordinates: [
      [0, 0],
      [2, 1],
      [1, 3],
      [4, 3],
    ],
  },
}

I run createPatch, and I get this:

[ { op: 'replace', path: '/geometry/coordinates/1/0', value: 2 },
  { op: 'replace', path: '/geometry/coordinates/2/0', value: 1 },
  { op: 'replace', path: '/geometry/coordinates/2/1', value: 3 },
  { op: 'replace', path: '/geometry/coordinates/3/0', value: 4 } ]

At this small scale, there is no problem. But when these LineStrings contain hundreds of coordinates, each coordinate must be diffed against each other, which is expensive, and also creates large patches (as there are many operations and each has a long pointer). In a real world benchmark, it took over a second to create a patch, and 6 seconds to apply!

I do however think I have figured out a solution, which involves adding a depth parameter to createPatch(), preventing excessively detailed patches. If I set this hypothetical depth parameter, I get this patch instead:

[ { op: 'replace',
    path: '/geometry/coordinates',
    value: [       
      [0, 0],
      [2, 1],
      [1, 3],
      [4, 3],
    ] } ]

Not only is this patch more succinct, but it better reflects the reality of how I'm altering LineStrings - I am not modifying individual points as such, I'm changing a line in its entirety. In a real world scenario, this more than halves the size of the patch and computes near-instantly.

I am planning to implement this on my fork (it's not a very big change), and create a test case.

chbrown commented 6 years ago

First off, glad to hear rfc6902 is smarter than the competition šŸ˜„ (at least on your data)

I understand your problem, and agree that the patches you speculate are nicer and semantically saner, but I'm worried that a single maximum depth parameter is simplistic, and not the right solution.

For example: (if I recall correctly) GeoJSON can have nested structures, where shapes contain shapes which contains shapes, etc. What you might want in that case is for a "coordinate" tuple to be treated as an atomic entity, no matter how deep it occurs. If rfc6902 can handle that somehow (see solution 2. below), that seems preferable.

I see two potential solutions with a broader vision:

  1. Refactor cost calculations in diffArrays and diffObjects (and aggregate rootward/higher up the tree structure) to prefer fewer replace operations despite a larger value payload.
    • This would be more automatic, no additional parameter required. But it might be too smart for its own good, incurring (a lot of) additional complexity while not necessarily handling every user's use-case.
  2. Implement an additional parameter that allows the user to pass in custom diffMyTypes functions that short-circuit the usual recursion on certain conditions. In your example, you could check the that last token in the provided Pointer is "coordinates" and intercept the default diffArrays handling, always returning a replace operation if the two given children of coordinates are not equal.
    • I'm not exactly sure how this would work out, but I think it's more doable and in the end more flexible than solution 1.
    • It would also handle cases like #15.
diachedelic commented 6 years ago

I like the idea of providing a function to check the recursion, I have already run into a situation where the depth param truncated a part of my patch I hadn't anticipated. I suppose the diffMyTypes function should just take a single param, ptr?

chbrown commented 6 years ago

Hmm, I haven't thought it through all the way; my first sense was that you'd pass it the input and output objects, too, just like all the other diff* functions. If you just gave it a single argument, ptr: Pointer, then I guess it would have to return another function, which in turn returned Operation[] like the existing diff* functions? But then what's the point of the indirection? I might be convinced to use such an abstraction, but don't immediately see the benefit...

diachedelic commented 6 years ago

I don't understand why this function would be returning Operation[] - would it function as some sort of replacement/intercept for diffAny()? I do understand it would be nice to return something more interesting than a boolean.

diachedelic commented 6 years ago

Maybe i get it...if I wanted to restrict the depth of my diff, I would write something like this:

import {diffValues, diffAny} from 'rfc6902/diff'

createPatch(input, output, function(input, output, ptr) {
  if (ptr.tokens.length >= 4) {
    return diffValues(input, output, ptr)
  }
  return diffAny(input, output, ptr)
})
chbrown commented 6 years ago

Yeah, exactly! Or maybe createPatch would defer to diffAny if the given custom diff* parameter returned anything besides an Array (e.g. undefined), so that you (the user) wouldn't have to worry about importing and calling it (and you'd have to pass in the custom function, which would be annoying if you were using => function syntax)

As another example, here's how you might stop without recursing deeply into "coordinates":

import {diffValues} from 'rfc6902/diff'

createPatch(input, output, (input, output, ptr) => {
  if (ptr.tokens[ptr.tokens.length - 1] == 'coordinates') {
    return diffValues(input, output, ptr)
  }
})
diachedelic commented 6 years ago

Looks good to me - what would this function be called? diffMyType?

chbrown commented 6 years ago

Good question. ~diffOverride~? ~diffDispatch~? diffCustom?

Since it's a function argument, the API is purely positional, so the name isn't terribly important ā€” it can change later without breaking the API.

Do you want to take a crack at it and update PR #30? This solution should look quite similar, but with a little extra logic needed in createPatch. The default value should be a function that returns undefined. And as I remarked at https://github.com/chbrown/rfc6902/pull/30#issuecomment-397038544:

Also, this new parameter should have defaults for all exported functions, to maintain API compatibility.

diachedelic commented 6 years ago

How about diffNode? Seeing as it is used to diff every single node in the object tree, and its default value is a noop, which is still executed on every node (to no effect).

chbrown commented 6 years ago

Hrrmm. By that reasoning, which is sound, that's what I should have named diffAny. For parity, maybe preDiffAny? tryDiffAny? optionalDiffAny?

I also have a weak preference to avoid the name "node" in my Node.js projects, to make full text search a little less ambiguous.

diachedelic commented 6 years ago

tryDiffAny makes the most sense internally, but the least sense for a casual user looking at createPatch.

diachedelic commented 6 years ago

Also, internally, tryDiffAny doesn't makes sense, as by default it is diffAny

chbrown commented 6 years ago

Yeah I didn't consider using diffAny as the default for this new parameter, nor wrapping it in a closure in createPatch (which seems quite elegant!), and I'm trying to think if there'd be any conflicts / loopholes / edge-cases in not knowing whether diffNode is a custom diff function or the built-in diffAny (e.g. infinite loops, or failing to percolate the custom diff function throughout the tree traversal).

Now that I've seen your implementation, I'm of the opinion that simply diff might be the most appropriate name for this new argument.

diachedelic commented 6 years ago

@chbrown I've updated my pull request, please let me know if there's anything else you'd like changing (or tests added?)

chbrown commented 6 years ago

Ok, I merged it with some changes: primarily naming , and dropping 176c48c ā€” which I'll handle separately. Thanks!

diachedelic commented 6 years ago

Cheers, thanks for the quick publish