GoPavel commented 1 month ago

This is an MR for the 3rd Milestone of the Serokell's grant about improving Motoko's base library.

The main goal of the PR is to introduce a new functional implementation of the set data structure to the' base' library. Also, it brings a few changes to the new functional map that was added in #664 , #654 .

General changes:

rename PersistentOrderedMap to OrderedMap (same for the OrderedSet)
improve docs

Functional Map changes:

New functionality:

add any/all functions
add contains function
add minEntry/maxEntry

Optimizations:

Store size in the Map, benchmark results

Fixup:

add entriesRev(), remove iter()

NEW functional Set:

The new data structure implements an ordered set interface using Red-Black trees as well as the new functional map from the 1-2 Milestones.

API implemented:

Basic operations (based on the map): put, delete, contains, fromIter, etc
Maps and folds: map, mapFilter, foldLeft, foldRight
Set operations: union , intersect, diff, isSubset, equal
Additional operations (as for the OrderedMap): min/max, all/some

Maintainance support:

Unit, property tests
Documentation

Applied optimizations:

Same optimizations that were useful for the functional map:
- inline node color
- float-out exceeded matching in iteration
- map/filterMap through foldLeft
- direct recursion in foldLeft
- Benchmark results for all four optimizations together
- store size in the root of the tree, benchmark results
- Pattern matching order optimization, benchmark results
- Other optimizations:
- Inline code of OrderedMap instead of sharing it, benchmark results
- intersect optimization: use order of output values to build the resulting tree faster, see https://github.com/serokell/motoko-base/pull/39
- isSubset, equal optimization: use early exit and use order of subtrees to reduce intermediate tree height, see https://github.com/serokell/motoko-base/pull/37

Rejected optimizations:

Nipkow's implementation of set operation [Tobias Nipkow's "Functional Data Structures and Algorithms", 117]. Initially, we were planning to use an implementation of set operations (intersect, union, diff) from Nipkow's book. However, the experiment shows that naive implementation with a simple size heuristic performs better. The benchmark results are comparing 3 versions:
- persistentset_baseline -- original implementation that uses Nipkow's algorithms. However, the black height is calculated before each set operation (the book assumes it's stored).
- persistentset_bh -- the same as the baseline but the black height is stored in each node.
- persistentset -- naive implementation that looks up in a smaller set and modifies a bigger one (it gives us O(min(n,m)log((max(n,m)) which is very close to Nipkow's version). Sizes of sets are also stored but only in the root. The last one outperforms others and keeps a tree slim in terms of byte size. Thus, we have picked it.

Final benchmark results:

Collection benchmarks

	binary_size	generate	max mem	batch_get 50	batch_put 50	batch_remove 50	upgrade
orderedset+100	218_168	186_441	37_916	53_044	121_237	127_460	346_108
trieset+100	211_245	574_022	47_652	131_218	288_429	268_499	729_696
orderedset+1000	218_168	2_561_296	520_364	69_883	158_349	170_418	3_186_579
trieset+1000	211_245	7_374_045	633_440	162_806	383_594	375_264	9_178_466
orderedset+10000	218_168	40_015_301	320_532	84_660	192_931	215_592	31_522_120
trieset+10000	211_245	105_695_670	682_792	192_931	457_923	462_594	129_453_045
orderedset+100000	218_168	476_278_087	3_200_532	98_553	230_123	258_372	409_032_232
trieset+100000	211_245	1_234_038_235	6_826_516	222_247	560_440	549_813	1_525_692_388
orderedset+1000000	218_168	5_514_198_432	32_000_532	115_836	268_236	306_896	4_090_302_778
trieset+1000000	211_245	13_990_048_548	68_228_312	252_211	650_405	642_099	17_455_845_492

set API

	size	intersect	union	diff	equals	isSubset
orderedset+100	100	146_264	157_544	215_871	28_117	27_726
trieset+100	100	352_496	411_306	350_935	201_896	201_456
orderedset+1000	1000	162_428	194_198	286_747	242_329	241_938
trieset+1000	1000	731_650	1_079_906	912_629	2_589_090	4_023_673
orderedset+10000	10000	177_080	231_070	345_529	2_383_587	2_383_591
trieset+10000	10000	3_986_854	21_412_306	5_984_106	46_174_710	31_885_381
orderedset+100000	100000	190_727	267_008	402_081	91_300_348	91_300_393
trieset+100000	100000	178_863_894	209_889_623	199_028_396	521_399_350	521_399_346
orderedset+1000000	1000000	205_022	304_937	464_859	912_901_595	912_901_558
trieset+1000000	1000000	1_782_977_198	2_092_850_787	1_984_818_266	5_813_335_155	5_813_335_151

new set API

	size	foldLeft	foldRight	mapfilter	map
orderedset	100	16_487	16_463	88_028	224_597
orderedset	1000	133_685	131_953	1_526_510	4_035_782
orderedset	10000	1_305_120	1_287_495	28_455_361	51_527_733
orderedset	100000	13_041_665	12_849_418	344_132_505	630_692_463
orderedset	1000000	130_428_573	803_454_777	4_019_592_041	7_453_944_902

crusso commented 2 weeks ago

One quick observation. subset and equality seem very expensive: I wonder if that's because the isSubsetHelper still uses an iterator, that allocates. Maybe just use recursion again?

Perhaps there's even faster way but I don't know enough about red-black trees to suggest one. If the representation is canonical, maybe you can exploit that. But I'm not sure it is or depends on the order of insertions/deletions. Do you know?

GoPavel commented 2 weeks ago

One quick observation. subset and equality seem very expensive: I wonder if that's because the isSubsetHelper still uses an iterator, that allocates. Maybe just use recursion again?

Perhaps there's even faster way but I don't know enough about red-black trees to suggest one. If the representation is canonical, maybe you can exploit that. But I'm not sure it is or depends on the order of insertions/deletions. Do you know?

It's not canonical, so we cannot have just a linear one-to-one comparison. I've experimented with a couple of alternatives I think it works pretty well now. I've managed to use the order of nodes in the tree to optimize it a bit.

I have updated the benchmark results in the description, for the comparison of the version before and after the last commit see https://github.com/serokell/motoko-base/pull/37 .

crusso commented 2 weeks ago

It's not canonical, so we cannot have just a linear one-to-one comparison. I've experimented with a couple of alternatives I think it works pretty well now. I've managed to use the order of nodes in the tree to optimize it a bit.

I have updated the benchmark results in the description, for the comparison of the version before and after the last commit see serokell#37 .

That's much better indeed! Thank you!