dfinity / motoko-base

The Motoko base library
Apache License 2.0
483 stars 99 forks source link

[Milestone-3] Serokell: New OrderedSet.mo & OrderedMap fixup #662

Closed GoPavel closed 1 week ago

GoPavel commented 1 month ago

This is an MR for the 3rd Milestone of the Serokell's grant about improving Motoko's base library.

The main goal of the PR is to introduce a new functional implementation of the set data structure to the' base' library. Also, it brings a few changes to the new functional map that was added in #664 , #654 .

General changes:

Functional Map changes:

New functionality:

Optimizations:

Fixup:

NEW functional Set:

The new data structure implements an ordered set interface using Red-Black trees as well as the new functional map from the 1-2 Milestones.

API implemented:

Maintainance support:

Applied optimizations:

Rejected optimizations:

Final benchmark results:

Collection benchmarks

binary_size generate max mem batch_get 50 batch_put 50 batch_remove 50 upgrade
orderedset+100 218_168 186_441 37_916 53_044 121_237 127_460 346_108
trieset+100 211_245 574_022 47_652 131_218 288_429 268_499 729_696
orderedset+1000 218_168 2_561_296 520_364 69_883 158_349 170_418 3_186_579
trieset+1000 211_245 7_374_045 633_440 162_806 383_594 375_264 9_178_466
orderedset+10000 218_168 40_015_301 320_532 84_660 192_931 215_592 31_522_120
trieset+10000 211_245 105_695_670 682_792 192_931 457_923 462_594 129_453_045
orderedset+100000 218_168 476_278_087 3_200_532 98_553 230_123 258_372 409_032_232
trieset+100000 211_245 1_234_038_235 6_826_516 222_247 560_440 549_813 1_525_692_388
orderedset+1000000 218_168 5_514_198_432 32_000_532 115_836 268_236 306_896 4_090_302_778
trieset+1000000 211_245 13_990_048_548 68_228_312 252_211 650_405 642_099 17_455_845_492

set API

size intersect union diff equals isSubset
orderedset+100 100 146_264 157_544 215_871 28_117 27_726
trieset+100 100 352_496 411_306 350_935 201_896 201_456
orderedset+1000 1000 162_428 194_198 286_747 242_329 241_938
trieset+1000 1000 731_650 1_079_906 912_629 2_589_090 4_023_673
orderedset+10000 10000 177_080 231_070 345_529 2_383_587 2_383_591
trieset+10000 10000 3_986_854 21_412_306 5_984_106 46_174_710 31_885_381
orderedset+100000 100000 190_727 267_008 402_081 91_300_348 91_300_393
trieset+100000 100000 178_863_894 209_889_623 199_028_396 521_399_350 521_399_346
orderedset+1000000 1000000 205_022 304_937 464_859 912_901_595 912_901_558
trieset+1000000 1000000 1_782_977_198 2_092_850_787 1_984_818_266 5_813_335_155 5_813_335_151

new set API

size foldLeft foldRight mapfilter map
orderedset 100 16_487 16_463 88_028 224_597
orderedset 1000 133_685 131_953 1_526_510 4_035_782
orderedset 10000 1_305_120 1_287_495 28_455_361 51_527_733
orderedset 100000 13_041_665 12_849_418 344_132_505 630_692_463
orderedset 1000000 130_428_573 803_454_777 4_019_592_041 7_453_944_902
crusso commented 2 weeks ago

One quick observation. subset and equality seem very expensive: I wonder if that's because the isSubsetHelper still uses an iterator, that allocates. Maybe just use recursion again?

Perhaps there's even faster way but I don't know enough about red-black trees to suggest one. If the representation is canonical, maybe you can exploit that. But I'm not sure it is or depends on the order of insertions/deletions. Do you know?

GoPavel commented 2 weeks ago

One quick observation. subset and equality seem very expensive: I wonder if that's because the isSubsetHelper still uses an iterator, that allocates. Maybe just use recursion again?

Perhaps there's even faster way but I don't know enough about red-black trees to suggest one. If the representation is canonical, maybe you can exploit that. But I'm not sure it is or depends on the order of insertions/deletions. Do you know?

It's not canonical, so we cannot have just a linear one-to-one comparison. I've experimented with a couple of alternatives I think it works pretty well now. I've managed to use the order of nodes in the tree to optimize it a bit.

I have updated the benchmark results in the description, for the comparison of the version before and after the last commit see https://github.com/serokell/motoko-base/pull/37 .

crusso commented 2 weeks ago

It's not canonical, so we cannot have just a linear one-to-one comparison. I've experimented with a couple of alternatives I think it works pretty well now. I've managed to use the order of nodes in the tree to optimize it a bit.

I have updated the benchmark results in the description, for the comparison of the version before and after the last commit see serokell#37 .

That's much better indeed! Thank you!

GoPavel commented 2 weeks ago

Also, I've optimized intersect for Set and fixed doc comments: https://github.com/serokell/motoko-base/pull/39

crusso commented 1 week ago

Squashed and fast-forward merged with commit 1961fab3ba2d1459ea6bae29c1c997328e03328f

(see https://dfinity.slack.com/archives/D07MAEML9RD/p1731527989433379)