Optimize tree construction

igarnier commented 2 years ago

This PR optimizes tree construction. More precisely, it optimizes the process by which an optimal split of the dataset is computed. In master, this split is performed as follows:

create a list of pairs of non-constant features together with the set of values associated to that feature
for each pair feature, value, partition the dataset
find partition that minimizes cost

In this PR, we proceed as follows:

for each non-constant feature, split the dataset into buckets v1, b1; ...; vn, bn indexed by values sorted in increasing order
fold over partitions and find the one minimizing the cost, the fold is easily computed as the sequence b1, b2 @ ... @ bn; b1 @ b2; b3 @ ... @ bn etc (using rev_append instead of @)

It's probably possible to optimize even further using a data structure with O(1) concatenation instead of lists (eg https://gitlab.inria.fr/fpottier/sek)

UnixJunkie commented 2 years ago

I am not a fan of the '|>' operators (I have been living without it for years).

If the Sek data structure makes things even faster, then I am OK we start to use it.

UnixJunkie commented 2 years ago

You can kill the dead code. Note that you don't need to do any change to this PR. I'll merge it as is then maybe just do some syntactic changes.

UnixJunkie commented 2 years ago

this shaves a few more seconds in the classifier case

UnixJunkie / orf

Optimize tree construction #18