Improvements in `CDB_JenksBins`

Algunenano commented 6 years ago

Now it ignores NULLs.
No longer puts the same value in multiple categories.
Removes all limits related to size.
If not set, the number of iterations done is based now on the size of the array.
Fixed multiple bugs.
The internal function CDB_JenksBinsIteration has changed its signature.

Testing that the update process leaves a clean slate:

test_overviews=# WITH data_small AS (
    SELECT Array[0.99, 1.0, 10.01, 10.01, 10.01, 10.01]::numeric[] AS s
)
SELECT unnest(CDB_JenksBins(s, 4)) FROM data_small;
 unnest 
--------
   0.99
  10.01
  10.01
  10.01
(4 rows)

test_overviews=# ALTER EXTENSION cartodb UPDATE TO "0.24.0";
ALTER EXTENSION
test_overviews=# WITH data_small AS (
    SELECT Array[0.99, 1.0, 10.01, 10.01, 10.01, 10.01]::numeric[] AS s
)
SELECT unnest(CDB_JenksBins(s, 4)) FROM data_small;
 unnest 
--------
   0.99
    1.0
  10.01
(3 rows)
test_overviews=# \df cdb_jenksbins*
                                                                                List of functions
 Schema  |          Name          | Result data type |                                                 Argument data types                 
                                | Type 
---------+------------------------+------------------+-------------------------------------------------------------------------------------
--------------------------------+------
 cartodb | cdb_jenksbins          | numeric[]        | in_array numeric[], breaks integer, iterations integer DEFAULT 0, invert boolean DEF
AULT false                      | func
 cartodb | cdb_jenksbinsiteration | numeric[]        | in_matrix numeric[], breaks integer, classes integer[], invert boolean, sdam numeric
, max_search integer DEFAULT 50 | func
(2 rows)

Can you guys have a look? If you are busy I'll reassign it to the RT developer. Any ideas to add extra testing are welcome.

andy-esch commented 6 years ago

🎉 I'll take a look but won't be able to until later today

andy-esch commented 6 years ago

The test outputs look reasonable (except for my one comment). Besides the nicely clustered test data, have you compared the outputs to what's found in other implementations of Jenks?

andy-esch commented 6 years ago

By the way, scikit-learn's k-means implementation matches the test data but doesn't perform the set operation for the outputs:

>> from sklearn.cluster import KMeans
>> import numpy as np
>> kmeans = KMeans(n_clusters=4)
>> X = np.array([0.99, 1.0, 10.01, 10.01, 10.01, 10.01]).reshape((-1, 1))
>> km_out = kmeans.fit(X)
>> km_out.cluster_centers_

    array([[10.01],
           [ 1.  ],
           [ 0.99],
           [10.01]])

Algunenano commented 6 years ago

The test outputs look reasonable (except for my one comment). Besides the nicely clustered test data, have you compared the outputs to what's found in other implementations of Jenks?

I haven't found any 2 implementations that have the same output so it's difficult to compare with anything. Jenks is highly dependant on where you start (here it used to be quantiles and now it's a division on with equal number of different values) and what you do when you find a minimum (here we randomize another start N times and finally select the best of them) so for any non trivial case the results differ quite a bit. Besides that, I've seen some implementations that discard duplicates or include NULLs so the results are even more different.

Algunenano commented 6 years ago

I've done some tests with my local maps and, once I get to datasets with values too different from each other (for example, area size around SE Australia), it is taking too long to calculate anything. This also happens when you have too many different values (e.g. 1 million rows and you ask for Jenks of cartodb_id).

An example (table with 28k values, ~6k unique):

Select count(*) from mb_2016_sa where areasqkm16 IS NOT NULL;
 count 
-------
 28205
(1 row)

Select count(*) from (Select distinct(areasqkm16) from mb_2016_sa) _a;
 count 
-------
  5796
(1 row)

Query (I think it's partially generated by Builder and the rest is Windshaft) executed with this PR:

SELECT
    min(areasqkm16) min_val,
         max(areasqkm16) max_val,
         avg(areasqkm16) avg_val,
         CDB_JenksBins(array_agg(areasqkm16::numeric), 5) as jenks
         FROM (SELECT * FROM mb_2016_sa) _table_sql WHERE areasqkm16 IS NOT NULL
         AND
           areasqkm16 != 'infinity'::float
         AND
           areasqkm16 != '-infinity'::float
         AND
           areasqkm16 != 'NaN'::float;
 min_val |   max_val   |     avg_val      |                      jenks                       
---------+-------------+------------------+--------------------------------------------------
       0 | 102697.3966 | 34.8971780889912 | {0.9414,10.3084,225.5936,22695.2478,102697.3966}
(1 row)

Time: 44871.570 ms (00:44.872)

To improve this I've decided to modify the algorithm to increase the number of values that are moved in each iteration exponentially until it finds a worse value, where it goes back and tries moving a smaller amount of values. For example:

Iteration 1: Move 1 element (smaller group gets 1 extra element, the biggest loses one, all intermediate classes are shifted). The result is better.
Iteration 2: Move 2 elements. The result is better.
Iteration 3: Move 4 elements. The result is better.
Iteration n - 1: Move 2^(n-1) elements. The result is better.
Iteration n: Move 2^n elements (e.g. 4096). The result is worse.
Iteration n+1: Go back to the status in n-1 and move ~ elements/8 elements (512 in this case). The result is better than n - 1.
Iteration n+2. Move 2*(elements moved in n+1) elements (1024 in this case). And so on.

After those changes.

 min_val |   max_val   |     avg_val      |                      jenks                       
---------+-------------+------------------+--------------------------------------------------
       0 | 102697.3966 | 34.8971780889912 | {0.814,17.7499,2198.8101,28127.5279,102697.3966}
(1 row)

Time: 934.776 ms

I blame the different result to jumping over local minima (which there are a lot in this method) but I think the speed up is worth it, even more if we consider that we have already heavily changed the output in this PR. On other datasets the result doesn't change.

I've experimented with only doing this if the max and min groups were the same between iterations but that meant a lot of resets, so still too slow. I've also tried (n) -> (n+1) instead of (n) -> (2*n) but that was slow too.

I'll add another commit to the PR with that change.

CartoDB / cartodb-postgresql

Improvements in `CDB_JenksBins` #340