DusanKasan / Knapsack

Collection pipeline library for PHP
http://dusankasan.github.io/Knapsack/
MIT License
536 stars 56 forks source link

Improve performance on distinct #68

Closed Ekman closed 4 years ago

Ekman commented 4 years ago

The difference in performance on distinct vs array_unique is so large that we've decided to refactor some of our code to use the latter instead.

I wanted to see if I could improve the performance on distinct. I copied a one of the benchmarks in tests/performance and modified it to compare distinct vs array_unique instead.

I ran the benchmark using the current solution and got this result:

+-------------------------------------------------------------------------------------------+-----------------------+---------------------------+----------------------+
| operation details                                                                         | native execution time | collection execution time | difference (percent) |
+-------------------------------------------------------------------------------------------+-----------------------+---------------------------+----------------------+
| array_unique vs Collection::distinct on 10000 integers (addition)                           | 0.00062398910522461s  | 0.17144465446472s         | 27475%             |
| array_unique vs Collection::distinct on 10000 strings (concatenation)                       | 0.00031847953796387s  | 1.1402836799622s          | 358039%            |
| array_unique vs Collection::distinct on 10000 md5 invocations                               | 0.00044071674346924s  | 1.3995371103287s          | 317559%            |
| array_unique vs Collection::distinct for 10000 integers n, counting sum(0, n) the naive way | 0.00051562786102295s  | 0.14881670475006s         | 28861%             |
+-------------------------------------------------------------------------------------------+-----------------------+---------------------------+----------------------+

I then changed the distinct implementation. I don't know anything about the underlying implementation of array or in_array in PHP, I just know PHP arrays are also key/value maps. Key/value maps tend to have very fast lookup in programming languages. According to this then in_array have a time complexity of O(n) while isset have a time complexity of "close to O(1)". By switching to isset we'll get a huge time complexity improvement.

I ran the benchmark again and got this result:

+---------------------------------------------------------------------------------------------+-----------------------+---------------------------+----------------------+
| operation details                                                                           | native execution time | collection execution time | difference (percent) |
+---------------------------------------------------------------------------------------------+-----------------------+---------------------------+----------------------+
| array_unique vs Collection::distinct on 10000 integers (addition)                           | 0.00050990581512451s  | 0.050813627243042s        | 9965%                |
| array_unique vs Collection::distinct on 10000 strings (concatenation)                       | 0.00026485919952393s  | 0.050021052360535s        | 18885%               |
| array_unique vs Collection::distinct on 10000 md5 invocations                               | 0.00030789375305176s  | 0.053486967086792s        | 17371%               |
| array_unique vs Collection::distinct for 10000 integers n, counting sum(0, n) the naive way | 0.00073227882385254s  | 0.065831804275513s        | 8989%                |
+---------------------------------------------------------------------------------------------+-----------------------+---------------------------+----------------------+

That's a huge improvement! Unfortunately, still not nearly as good as array_unique. Is there any other way this can be done? Let me know if you have any questions or comments.

Ekman commented 4 years ago

I had some colleagues point out that this would be a breaking change since in_array can take an object, while isset can't.

An alternative solution would be to do something like this:

function distinct($collection)
{
    $generatorFactory = function () use ($collection) {
        $distinctValues = [];
        $distinctObjects = new \SplObjectStorage();

        foreach ($collection as $key => $value) {
            if (is_object($value)) {
                if (!$distinctObjects->contains($value)) {
                    $distinctObjects->attach($value);
                    yield $key => $value;
                }
            } else {
                if (!isset($distinctValues[$value])) {
                    $distinctValues[$value] = true;
                    yield $key => $value;
                }
            }
        }
    };

    return new Collection($generatorFactory);
}

Let me know what you think.

jdreesen commented 4 years ago

It's not that easy, because in_array cannot only handle objects, but even arrays (see: https://www.php.net/manual/en/function.in-array.php#example-6324), etc. (probably everything that can be an array element), while isset can only handle elements that are valid as array keys (basically ints and strings)...

Ekman commented 4 years ago

@jdreesen That's a fair point.

I propose one of these solutions:

  1. Use isset and introduce a backwards incompatible change
  2. Use SplObjectStorage for checking distinct objects. Use an array to check distinct primitive values
  3. Create a wrapper for primitive values so that everything can be stored in a SplObjectStorage
  4. Keep using in_array

It would be a shame to keep the solution as-is. Decreasing the time complexity from O(n^2) to O(n) is a huge improvement, even if it requires some weird code. This is truly the example where the benefits of optimized code outweighs readable code. :)

What do you guys think?