forestry-labs / Rforestry

https://forestry-labs.github.io/Rforestry/
34 stars 9 forks source link

`seed` same in different objects #120

Open petrovicboban opened 1 year ago

petrovicboban commented 1 year ago

Let's have

form random_forestry import RandomForest

rf_1 = RandomForest()
rf_2 = RandomForest()

Since default value for seed is random number, we expect:

assert rf_1.seed != rf_2.seed

but that's not what's happening.

although rf_1 and rf_2 are really different objects:

assert id(rf_1) != id(rf_2)

their attributes which have default value set are not:

assert id(rf_1.seed) == id(rf_2.seed)

Python does copy on write here, so if value is changed later, it will create new seed object then:

rf_1.seed = 5
assert id(rf_1.seed) != id(rf_2.seed)

The implication is that if you create different RandomForest objects with seed not explicitly set, seed will be random but all objects will have the same value. Is this acceptable to you? @edwardwliu @theo-s

edwardwliu commented 1 year ago

It sounds like two RandomForest objects will have different seeds when instantiated, but the objects are considered equal since the default value of seed=randrange() is the same. Is this still an issue if we the class explicitly assigns the seed in an initialization or post initialization step (even when the user doesn't provide one)?

petrovicboban commented 1 year ago

Two RandomForest objects will have the same seeds when instantiated. Although, it you re-run script, seed value will be different, but again, the same for both objects. I don't think we can do it in __init__ and still pass all tests in test_sklearn_compatible_estimator. The only way to do it is to do it at the start of fit like we do for some other parameters: leave the parameter in __init__ but set default value to None, and then, in fit, check if it's None, and set random value if it is.

petrovicboban commented 1 year ago

Actually, we can't do it. We'll have to move seed to fit parameters.