Closed Algunenano closed 6 years ago
🎉 I'll take a look but won't be able to until later today
The test outputs look reasonable (except for my one comment). Besides the nicely clustered test data, have you compared the outputs to what's found in other implementations of Jenks?
By the way, scikit-learn's k-means implementation matches the test data but doesn't perform the set operation for the outputs:
>> from sklearn.cluster import KMeans
>> import numpy as np
>> kmeans = KMeans(n_clusters=4)
>> X = np.array([0.99, 1.0, 10.01, 10.01, 10.01, 10.01]).reshape((-1, 1))
>> km_out = kmeans.fit(X)
>> km_out.cluster_centers_
array([[10.01],
[ 1. ],
[ 0.99],
[10.01]])
The test outputs look reasonable (except for my one comment). Besides the nicely clustered test data, have you compared the outputs to what's found in other implementations of Jenks?
I haven't found any 2 implementations that have the same output so it's difficult to compare with anything. Jenks
is highly dependant on where you start (here it used to be quantiles and now it's a division on with equal number of different values) and what you do when you find a minimum (here we randomize another start N times and finally select the best of them) so for any non trivial case the results differ quite a bit.
Besides that, I've seen some implementations that discard duplicates or include NULLs so the results are even more different.
I've done some tests with my local maps and, once I get to datasets with values too different from each other (for example, area size around SE Australia), it is taking too long to calculate anything. This also happens when you have too many different values (e.g. 1 million rows and you ask for Jenks of cartodb_id).
An example (table with 28k values, ~6k unique):
Select count(*) from mb_2016_sa where areasqkm16 IS NOT NULL;
count
-------
28205
(1 row)
Select count(*) from (Select distinct(areasqkm16) from mb_2016_sa) _a;
count
-------
5796
(1 row)
Query (I think it's partially generated by Builder and the rest is Windshaft) executed with this PR:
SELECT
min(areasqkm16) min_val,
max(areasqkm16) max_val,
avg(areasqkm16) avg_val,
CDB_JenksBins(array_agg(areasqkm16::numeric), 5) as jenks
FROM (SELECT * FROM mb_2016_sa) _table_sql WHERE areasqkm16 IS NOT NULL
AND
areasqkm16 != 'infinity'::float
AND
areasqkm16 != '-infinity'::float
AND
areasqkm16 != 'NaN'::float;
min_val | max_val | avg_val | jenks
---------+-------------+------------------+--------------------------------------------------
0 | 102697.3966 | 34.8971780889912 | {0.9414,10.3084,225.5936,22695.2478,102697.3966}
(1 row)
Time: 44871.570 ms (00:44.872)
To improve this I've decided to modify the algorithm to increase the number of values that are moved in each iteration exponentially until it finds a worse value, where it goes back and tries moving a smaller amount of values. For example:
n - 1
: Move 2^(n-1) elements. The result is better.n
: Move 2^n elements (e.g. 4096). The result is worse.n+1
: Go back to the status in n-1 and move ~ elements/8 elements (512 in this case). The result is better than n - 1
.n+2
. Move 2*(elements moved in n+1)
elements (1024 in this case). And so on.After those changes.
min_val | max_val | avg_val | jenks
---------+-------------+------------------+--------------------------------------------------
0 | 102697.3966 | 34.8971780889912 | {0.814,17.7499,2198.8101,28127.5279,102697.3966}
(1 row)
Time: 934.776 ms
I blame the different result to jumping over local minima (which there are a lot in this method) but I think the speed up is worth it, even more if we consider that we have already heavily changed the output in this PR. On other datasets the result doesn't change.
I've experimented with only doing this if the max and min groups were the same between iterations but that meant a lot of resets, so still too slow. I've also tried (n) -> (n+1)
instead of (n) -> (2*n)
but that was slow too.
I'll add another commit to the PR with that change.
CDB_JenksBinsIteration
has changed its signature.Testing that the update process leaves a clean slate:
Can you guys have a look? If you are busy I'll reassign it to the RT developer. Any ideas to add extra testing are welcome.