Issues with Array#union and Array#intersect

andrewplummer / Sugar

A Javascript library for working with native objects.

https://sugarjs.com/

MIT License

4.53k stars 306 forks source link

Issues with Array#union and Array#intersect #120

Closed andrewplummer closed 12 years ago

andrewplummer commented 12 years ago

Moving this discussion over from twitter raised by @ermouth, original example is here: http://ermouth.com/fastArray/

First I want to say thank you for raising this. There are a few different issues here and I want to address them separately. First, here is the detailed result of performing a union operation on your test page:

// Array 1
[
    { eccbc87e4b5ce2fe28308fd9f2a7baf3: 3 },
    6,
    3,
    true,
    3,
    "a87ff679a2f3e71d9181a67b7542122c",
    { e4da3b7fbbce2345d7772b0674a318d5: 5 },
    /6/g,
    /0/g,
    function(){},
    "8f14e45fceea167a5a36dedd4bea2543",
    4,
    8,
    "1679091c5a880faf6fb5e6087eb1b2dc",
    true,
    [5, "e4da3b7fbbce2345d7772b0674a318d5"]
  ];

// Array 2
[
    false,
    3,
    7,
    8,
    function() {},
    /1/g,
    [6, "1679091c5a880faf6fb5e6087eb1b2dc"],
    2,
    7,
    function() {},
    [8, "c9f0f895fb98ab9159f51fd0297e236d"],
    function() {},
    5,
    1,
    "cfcd208495d565ef66e7dff9f98764da",
    /6/g
  ]

//  Union of array 1 and array 2 (your script)
[
    { eccbc87e4b5ce2fe28308fd9f2a7baf3: 3 },
    6,
    3,
    true,
    "a87ff679a2f3e71d9181a67b7542122c",
    { e4da3b7fbbce2345d7772b0674a318d5: 5 },
    /6/g,
    /0/g,
    function() {},
    "8f14e45fceea167a5a36dedd4bea2543",
    4,
    8,
    "1679091c5a880faf6fb5e6087eb1b2dc",
    [5, "e4da3b7fbbce2345d7772b0674a318d5"],
    false,
    7,
    function() {},
    /1/g,
    [6, "1679091c5a880faf6fb5e6087eb1b2dc"],
    2,
    [8, "c9f0f895fb98ab9159f51fd0297e236d"],
    5,
    1,
    "cfcd208495d565ef66e7dff9f98764da",
  ]

// Union of array 1 and array 2 (Sugar)
[
    { eccbc87e4b5ce2fe28308fd9f2a7baf3: 3 },
    6,
    3,
    true,
    "a87ff679a2f3e71d9181a67b7542122c",
    { e4da3b7fbbce2345d7772b0674a318d5: 5},
    /0/g,
    function() {},
    "8f14e45fceea167a5a36dedd4bea2543",
    4,
    8,
    "1679091c5a880faf6fb5e6087eb1b2dc",
    [5, "e4da3b7fbbce2345d7772b0674a318d5"],
    false,
    7,
    function() {},
    2,
    function() {},
    "c9f0f895fb98ab9159f51fd0297e236d",
    function() {},
    5,
    1,
    "cfcd208495d565ef66e7dff9f98764da",
  ]

First, as you pointed out there are a couple of errors in the Sugar version.

[8, "c9f0f895fb98ab9159f51fd0297e236d"] this element was found incorrectly in the resulting array separately, ie. it was brought into the result as separate elements instead of a nested array as it properly should have been.
Regexes found in the original arrays are not in the resulting array.

However, your script also had an error:

Functions found in the original array are not in the resulting array.

1 was due to a bug where Array#union was incorrectly flattening arrays being passed to it. I have added a unit test for this and fixed the problem.

2 was due to another, deeper issue with Array#unique, which I won't go into here but had farther reaching implications that could effect other methods by incorrectly interpreting regexes and functions found in arrays instead of performing a direct equal operation on them. It was a very good opportunity for cleanup, so I've added some tests here and fixed this as well.

This will fix any errors in these methods and will go out in the next release.

Now on to the biggest issue, which is the performance. Over twitter I had said that "this won't work for objects in arrays"... by that what I meant is that "this won't work for objects within arrays without some kind of an object hashing algorithm". And, in fact that is exactly what you have done for your test.

Now, I am aware of these methods of alleviating issues for O(n²) problems by leveraging hashes. In fact, Array#union and Array#intersect aren't the only ones that could benefit from this but in fact Array#unique most especially and others as well. However the problem is the cost involved.

First, this technique requires hashing objects based on their content, which does involve a performance cost. Whether or not that performance cost would outweigh the benefit gained is a question that can only be answered by benchmarking not only large arrays, but also various combinations of large objects and large arrays. I would argue that having large and complex JSON data structures is a much more common use-case than having massive arrays (at least on the scale of 10s of thousands and higher) and here is where I think the cost would start to come into play.

However, there is a bigger problem and that is the cost of introducing an external hashing algorithm, which is exactly what you've done in your example. I'm mostly worried about the added bulk to the codebase but also the problem of an external dependency that would have to be proven to work right. Lastly I would also be worried about this dependency should I decide to split up the library further, any methods that require unique would also require this dependency as well.

Finally there is the issue of hashing functions, which should never be considered "equal" unless they are equal by reference, and this is where the errors in your script are encountered.

So in summary, if all things were equal I would love to be able to do O(n log n) and gain that much more speed on array methods like this, but I don't think it should be done by adding a new large dependency, one that may actually hinder performance on large objects, and has a clear weakness when considering functions as well.

I'm interested in more feedback and if there's another angle or another way in which this could be implemented, I'm open to it.

In the meantime your test has already improved Sugar in the errors you've uncovered so thank you much for that!

ermouth commented 12 years ago

About hashing:

I can implement it without hashing -- but the code will be little more complicated. We just need to create separate subobjects for functions, regexps, numbers, strings and so on and write pairs "value":%value% into appropriate subobject, not in common hash-array in the manner of "md5(value)":value.
But I think it'll be a good approach to add md5 into Sugar ))))

About errors in my script. Please, explain why do you think union must save all instances of textually equivalent functions as separate instances.

You do assume child arrays and objects equivalent by their values are the same and remove duplicates -- but they are not equal by reference.

Why the same behaviour applied to functions seem erroneous to you?

andrewplummer commented 12 years ago

I'm not sure what you mean by "subobjects" but it sounds like we're moving toward a JSON.stringify. This is what I was thinking about the moment you mentioned this method, and the reason it will be so tricky, as JSON.stringify would have to be shimmed as well as it does not exist in all browsers. Other issues will exist as well as you mentioned in the other ticket.
As a matter of fact there is another open ticket that you already commented on. Right now keeping Sugar from becoming too bloated is the name of the game. There are many areas where I already want to start trimming. Now that Sugar can be split down and extended I'm giving more attention to the possibility of adding md5, but this is as an extension. Adding it for fundamental array methods like these is something I would like to avoid. If it could be done at a BARE minimum then maybe but the lib you included in your test page was far too much overhead with far to little payoff.
Basically it comes down to user expectation. In fact, I will grant you that objects themselves can be argued not to be equal except by reference, but this is an exception we can make for such a common use pattern of object literals. Note here that this is a very big issue that touches upon the fundamental problem of JS using objects as hashes, ie. hashes can define a == operator but that doesn't mean that all objects should.

In any case this is completely untrue for functions. Functions can never be considered "equal" in any meaningful sense as they are not objects that hold data but instead procedures that do something. Actually, let me rephrase that. Although I could see someone forcing a definition of "equality" on a "textually equivalent" function, that's not the problem. The problem is turning that into an assumption that becomes an operation in a completely different context (like unique or union). For example, saying that two functions are "texually equivalent", and therefore can be uniqued inside an array is making a massive assumption, as the same "texually equivalent" function could be called twice to change the state of an application entirely. Array#unique is intending something entirely different. It is intending to remove duplicate values from an array. This is nonsensical in the case of functions because functions are not values, they return values. They are not data.

Ruby (and I'm assuming other languages as well) agree with me here:

lambda { 'x' } == lambda { 'x' }   ->  false

Note that this is despite == operators working on data containers like arrays and objects.

So the bottom line however methods like unique and union work, it is a requirement that they do not treat lambda functions as equal, except by reference.

andrewplummer commented 12 years ago

Another point I'd like to make on this is that the definition "textually equivalent" is flawed here as functions can be called with entirely different contexts (arguments and object binding). And in most cases this will completely change the outcome, whether that be changing state or returning a different value.

ermouth commented 12 years ago

Ok. But [1]==[1] => false and {a:1}=={a:1} => false, really. Despite of this fact your implementation do assume they are equal. And my do.

Applying your logic proposed for functions, these arrays and object are to be assumed as different. M?

About context.

The only additional property of textually equivalent function definitions is this. Context, right. But arguments is an array, initialized only on function call. So it's a context for function running, but not for us.

So, what about this. In fact, when we union or intersect arrays, we are to construct new array (result) whos branches are pushed in or assigned. But -- surpise -- either array.push(functionReference) and array[ind] = functionReference destroys old this.

So if you copy steps below to console, you can see that context of b[0] function pushed into a[] array changed from b[] to a[].

var a=[function(){return this;},"a"]; var b=[function(){return this;},"b"]; 
a.push(b[0]);
a[2]();

Assigment behaves the same way. So context is dropped anyway -- and textually equal functions are the same and should be merged.

I mean textually equal functions being called directly from sources a[] and b[] are able to return different outputs. But if we call these functions by traversing them from result array, output always be the same. Ain't it a good reason to merge them?

Certainly, there is a trick to save context -- but this case we need a proxy closure between caller and source function (proxy can remember old this). I think it's complex to implement idea -- cause result array will contain functions, which behavior be equal to source ones, but their String() representation will be artificial.

What do you think?

And -- olala -- md5-free 7x faster version of arr. JSON still in use.

ermouth commented 12 years ago

Oh, I'm sorry, I just understood my mistake. What a stupid I was!

var a = [(function(){var x = 0; return function(){return x};})()]; 
var b = [function(){return x}];

String(b[0])==String(a[0]) => true
But 
a[0]() => 0
b[0]() => throws ReferenceError

I'll check it out.

ermouth commented 12 years ago

Ok, now no dependency from json or md5, functions handled correctly -- assumed equal only if f1===f2 => true.

http://ermouth.com/fastArray/i2.html

andrewplummer commented 12 years ago

(equality) No, if we are talking about Ruby (as in my example) they are equal. I understand that in Javascript they are not, but for the purposes of Sugar, this is the requirement:

[1] == [1]     -> true
{a:1} == {a:1} -> true
fn1 == fn1     -> true
eval('var fn2 = ' + fn1); // "textually equivalent"
fn1 == fn2     -> false

This should hold true for all methods in the library with the notable exception of Array#indexOf, Array#lastIndexOf, etc. shims where ES5 spec explicitly requires a strict equality === operator.

(context) Sorry I didn't mean "context" in the "Javascripty" sense of the this object. What I meant was that functions are called with different arguments and sometimes with a different this object. So they can have a different "context" in the general meaning of the word. Maintaining the this object while manipulating the functions isn't the point here -- it shouldn't matter as both this object and the arguments only change when the function is called. The point is that when they are called they may be called with a different this and different arguments. So they in fact are not equal, even if they are "textually equivalent". Again, the point is that they are procedures, not values. This is why they should never be considered equal.
(about your code) First of all I want to note that the fact that you now have different hashes for different types is good, in fact it's vital. The previous code would have failed everywhere if this change hadn't happened.

Second, it appears that you're now doing a stripped down version of JSON.stringify. Part of me wants to say that if we are willing to go this far then Sugar should in fact shim it and go all the way (not something I'd like to do), or require it as a dependency (also not optimal). However, we have come this far so I want to keep going with this technique and see where it can take us.

It also occurs to me that considering you have now split the hash object into types, the difference between objects and arrays now becomes irrelevant, so the same stringify method can be applied. For example let's say the stringify method was simply key/value pairs separated by a comma, with the keys and values separated by =>:

{ foo: 'bar' } would be foo=>bar
['a','b','c']  would be 0=>a,1=>b,2=>c
{ 0: 'a', 1: 'b', 2: 'c' } would likewise be  0=>a,1=>b,2=>c

The bottom 2 would be identical but it wouldn't matter as they are separated by type. This would afford a lot of cleanup and still keep contents unique.

Of course the biggest problem is what to do about functions nested in objects. In this case (and only this case) I think it may be acceptable to use a simple [[toString]] call on the functions and consider them "textually equivalent", within the context of an object. Alternatively, simply having a function nested in an object could immediately render the object unique. I'm more for the 2nd option as it is more consistent and functions are not valid JSON anyway. The point with operating on objects in the end is that JSON data structures can be properly handled.

There are a number of convoluted constructs here but I think there is a possibility that this may lead to a faster solution. However considering all the steps needed to take to be robust, I'm dubious of the speed increase now. However it is worth a shot. I think the next step is to write more complex unit tests and see if this code can pass, then benchmark the hell out of it and see if the increase is actually worth the added complexity in code.

It's not a yes or no answer... if the speed benefits are dramatic and we can get this code down to something more manageable, I do think there's a chance it can work....

andrewplummer commented 12 years ago

I wanted to write a quick update to tell you where I'm at with this. I've run some benchmarks of my own and here are the results:

union of two one-dimensional arrays of length 3 each, 100,000 iterations
- Ermouth: avg 1082ms
- Sugar: avg 7975ms
union of two one-dimensional arrays of length 1,000 each, 1 iteration
- Ermouth: avg 4ms
- Sugar: avg 5411ms
union of two arrays with very large JSON objects inside, one of length 1 and one of length 2 (many properties, deeply nested, all 3 are separate references with 2 matching and 1 slightly different, production a union of 2 elements in the array), 50 iterations
- Ermouth: avg 4852ms
- Sugar: avg 2554ms

So, as expected your code beats Sugar for arrays with greater length (O(n²) problem). For very large arrays the difference is dramatic. However, also as expected, Sugar is faster for larger objects, as it takes time to stringify each and every property as well as hold references to the functions. Presumably this difference is greater on more complex objects, although not as dramatically as the O(n²) issue.

Note also that I've added a lot more unit tests and there are addition points of failure for your code such as union of [null], and [null], which incorrectly produces [null,null] (Sugar fails on this too, FYI).

But wait! Before we continue I like where this is headed, and it occurred to me that this overhead of attempting to produce a JSON-like string is completely unnecessary, as long as the contents of the object are properly unique by type. So I have written a greatly simplified function that not only is much less code, but also runs faster as well. Here is the code:

var fnCounter;

function newArrayUnion(a, b) {
  return newArrayUnique(a.concat(b));
}

function newArrayUnique(arr) {
  var o = {}, result = [];
  fnCounter = 0;
  arrayEach(arr, function(el) {
    var stringified = stringify(el),
        existing = o[stringified];
    if(!existing) {
      o[stringified] = true;
      result.push(el);
    }
  })
  return result;
}

// Sugar will handle this
function arrayEach(arr, fn) {
  for(var i = 0; i < arr.length; i++) {
    fn.call(arr, arr[i], i);
  }
}

function stringify(thing) {
  var type = typeof thing, value = String(thing);
  if(type == 'function') {
    value = fnCounter += 1;
  } else if(value.substr(0,15) === '[object Object]') {
    value = '';
    var keys = Object.keys(thing).sort();
    for(var i = 0; i < keys.length; i++) {
      value += keys[i] + stringify(thing[keys[i]]);
    }
  }
  return type + value;
}

This code simply includes the primitive type in the resulting string to ensure that properties like 1 and "1" etc. are unique, anything that cannot be stringified to a value is treated as an object and recursively iterates its properties over the stringify function. Additionally, it will simply increment a counter in the case of a function. So here is how this version performs:

union of two one-dimensional arrays of length 3 each, 100,000 iterations
- Ermouth: avg 1082ms
- Sugar: avg 7975ms
- New: avg 792ms
union of two one-dimensional arrays of length 1,000 each, 1 iteration
- Ermouth: avg 4ms
- Sugar: avg 5411ms
- New: avg 3ms
union of two arrays with very large JSON objects inside, one of length 1 and one of length 2 (many properties, deeply nested, all 3 are separate references with 2 matching and 1 slightly different, production a union of 2 elements in the array), 50 iterations
- Ermouth: avg 4852ms
- Sugar: avg 2554ms
- New: avg 1027ms

So this version actually beats all the others! This is good news. If you're sharp eyed you will notice, however that functions are treated as always unique, even if they are === by reference. I have written a version that does make sure functions which are equal by reference are properly merged, and it has almost no effect on performance here.

However after thinking about it more, it seems to me that the reasons I gave you for "textually equivalent" functions not being equal hold true for functions equivalent by reference as well. There should not be any difference between the two. So this version treats ALL functions as unique entities whenever they are encountered. This seems OK to me.

The important thing that I want to preserve is that functions can be found in arrays by reference... take for example this code:

var fn1 = function(){};
[fn1].indexOf(fn1) -> 0

This is not only expected behavior but also ES5 functionality so this cannot change and in addition my requirement is that all array functions that find and return objects need to follow this pattern. However, it now seems to me that the context of "finding an object in an array" and that of operations like Array#union and Array#unique are different, and the latter infers object equality for which I think it is safe to exclude functions. This is easier to explain also, ie. "functions are always considered unique" etc. Of course I want to think through the ramifications of this decision and I'm not firm on it yet.

If it is important that functions by reference are treated as equal, though, the new function can always be rewritten to do this. However this unfortunately raises the question of what is to be done about functions nested inside objects, which it would be nicer to avoid.

In either case, this new method is the fastest so far and it also passes all the unit tests, which the others don't, so I think we are on the right track!

ermouth commented 12 years ago

Ok, thank you.

There is no problem with functions nested into objects -- they are already handled correctly, not by textual representation.

my stringify() "hash" them, so different functions will be represented as different "hashes", equal (by ===, not by text) -- with equal "hash". This -- and only this -- part of implementation is O(n log n), so if you make test only of functions, you'll gain slightly weaker result than usual.

You may test it, it really works.

When serialaizing arrays, indexes are not inserted, so [1,2,"3"] -- so we need no optimization here.

There are two problems for now:

when serializing I use ᴥ character (Unicode 1D25) as delimiter, not quotations. I made this to avoid escaping in object string (it will slow down significantly). So [1,2,'3"3', function (){}] will serialize in [ᴥnumberᴥ1ᴥ,ᴥnumberᴥ2ᴥ,ᴥstringᴥ3ᴥ"3,ᴥfunctionᴥ1ᴥ]. So if ᴥ is in source strings we may have some collisions in hashes. For now I think this fact can be ignored.
I break initial order -- result is grouped by type and sorted.

I'll fix it. Speed of serialization will also be corrected.

andrewplummer commented 12 years ago

Ok... well the point for me is that I think the delimiters aren't required at all. I've already discovered an issue with my implementation but I will work on it as well. Keep in mind that in the end, complexity and bulk of code will also make a big difference about whether I will change this or not.

ermouth commented 12 years ago

http://ermouth.com/fastArray/i4.html

Now 2x-3x faster on serializing, so speed on small sets of curly json nodes are near to be equal to Sugar 1.2. Also I've incorporated huge 11Mb json test of 32x39 -- perfomances are nearly the same. In fact, your implemetation ~1.2x faster on Chrome, but equal or slower (or even significantly slower) on all other browsers.

Node order is preserved (except functions -- they go last). Without preserving order it'll be faster 1.2-1.3 times in general.

andrewplummer commented 12 years ago

Awesome! Sounds like you've been busy... I have too. In fact, I've gone and rewritten Array#unique to entirely use a new stringify method... I think we're converging on the same technique as I'm noticing you stringifying the variable types as I am doing. Anyway have a look:

http://sugarjs.com/union/

This page has the new code. Array#unique is entirely rewritten and Array#union, Array#intersect, and in fact even more methods now use it. I'm loading your latest code in for comparison as well but if you use the box at the top you can load a any new version in you want to try it out as long as you overwrite the global arr.union it should properly run both the benchmarks and the tests on that code....

As you can see there are a TON of unit tests that I have to get passing, so there is a lot of code that is handling more cases... however it looks like the benchmarks comparisons are quite close... this is probably about as fast as we're going to be able to get it I think!

ermouth commented 12 years ago

Great!

In fact I've included node type into hashkey (same manner as you do) only to prevent order -- so version, where different types are stored in different subbranches of hash array is litlle bit faster -- as I mentioned above. But I decided that in practical cases order is much more important than 15% speed up )

By the way, I've made up small optimizations over last version -- so now it's in any case faster than Sugar 1.2 even on short but spriggy arrays.

Also I tried to incorporate your new edge-version of Sugar into my testbench (http://ermouth.com/fastArray/i4-1.html), and... Yes, new Sugar shows tremendous results on long arrays. But on small arrays of branchy jsons new Sugar lost about 40% of speed if compared with 1.2.

I've spent about three hours last night to deeply optimize serialization. The main points are:

quick return from recursion if we have string (i use direct typeof check, anything else slows iterators down)
checking if subbranch is array, and them if it's object is quicker, than reverse sequence (I really don't know how this magic works, mb some hidden runtime optimizations take place).
hashing both arrays into single object is mentionably faster, than hashing #1 into object1, then #2 into object2 and then iterating through both of them merging keys
checkings like null==o[i] are slightly faster than o[i]==null (I can also hardly understand it)
checkings like null==o[i] are 10-20% faster than o.hasOwnProperty(i)

Now I give up, I'm even not a programmer to compete further )) Hope you'll mention me in your memoires and was glad to help a little.

Thanks for Sugar, it's great!

andrewplummer commented 12 years ago

I've actually taken this one step farther... now the stringify method drives my object equality as well (Object.equal). I wasn't sure if I could do pull this off but the unit tests are showing green (object equality is the largest number of unit tests I have) and it's looking possible. There is no speed benefit in the case of equality (previous code was fast enough), but the code is much cleaner. As an example testing equality on regexes, dates, and numbers is far simplified by stringifying and comparing the result, so there is no need for handling these special cases. It makes a lot of sense for these 2 areas to be going through the same code, as unique is essentially doing the equivalent of a deep equal operation, any anything equal in a union operation should also be equal in an equal operation. With this modification I'm also able to make Sugar equal "egal", which I'm glad about. It was nearly "egal" before except for a couple very minor edge cases I couldn't work out... now it should be fully "egal".

With this in mind, however, I am now including the internal [[toString]] class name in my stringification. This does have a minor impact on speed, but it's necessary to handle edge cases with egal equality. I tried going down a few paths of avoiding it, but it's MUCH simpler than making assumptions about methods on primitives that could be modified or nonexistent, so I think I am willing to sacrifice a small amount of performance here to keep out further potential edge cases. I also am including typeof type for other reasons (here's an example "hello" vs. new String('hello')) ... This is making the code more and more obvious and understandable as well though... Between the internal class name and typeof, 99% of edge cases are caught.

To respond to your points:

Good call! I changed the code to return if type string up front BEFORE making the toString call, and I think I regained the 40% you were talking about...
Well 2 things to note here. Since you are using Sugar for the type checking (except for primitive types), each call you are making is a call to [[toString]]. So if you have a date, for example, it will be the slowest as it is the last check. It should be faster I think to manually do a single call to [[toString]] then check against like this: klass === "[object RegExp]', etc. Also note thatisObjectis doing more than a simple check against[object Object]`, so it is inherently slower.
I see... in my version there is only ever a single object used... in fact this is because I'm doing a straight concat on the arrays, then a single unique operation. This is for union. intersect also uses only one hash, but it needs to step through one array entirely first as it's slightly different and more complex than a unique operation.
Makes no sense to me either :)
Hmm, that's interesting, although I don't see much speed change on my side. However I'm wary of null==o[i] because it will not properly catch properties purposely set to null and undefined. I suppose it's not an issue here though as it's all stringified anyway. I'm using i in o in my script.

Anyway I've sent up a new version so try it! It's still the edge script so URL should be the same. I think the big JSON should be faster... I will definitely give you a mention in the code and a shoutout when this is released!

andrewplummer commented 12 years ago

Cool this is now a part of 1.2.4.. @ermouth thanks again for all the help!