CitrineInformatics / lolo

A random forest
Apache License 2.0
41 stars 12 forks source link

Splittable random numbers for reproducible training #259

Open bfolie opened 2 years ago

bfolie commented 2 years ago

Bagger and MultiTaskBagger both train the individual models in parallel. Because the order of training is uncontrolled, this means that Lolo random forests are inherently non-reproducible, even if the bagging and the rngs for base learners are identical.

There are ways of guaranteeing reproducibility across multiple threads, and we should make use of them. SplittableRandom in Java A discussion in the context of numpy

iterateccvoelker commented 2 years ago

Hi, how is it going? Is there any update on the issue? Thank you so much for a brief message in advance! Best, Christoph

bfolie commented 2 years ago

Thanks for asking @BAMcvoelker . To be honest we hadn't thought about it in a while, but after seeing your comment we realized we have all of the tools and just need to thread them through.

We open sourced our splittable random number library, which means it's available to pull into Lolo. I will pull it in soon and use it to make bagged training reproducible.

iterateccvoelker commented 2 years ago

Thank you so much @bfolie for the update and for picking up the topic again. I look forward to the update!