dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.88k forks source link

sending no seed to mlcontext does not actually randomize stuff #7259

Open superichmann opened 1 month ago

superichmann commented 1 month ago

System Information (please complete the following information):

Describe the bug initializing mlcontext without any seed (new MLContext()) and training on the same data does not actually results in different models created by Regression.Trainers.FastForest() or Regression.Trainers.LightGbm().

To Reproduce Steps to reproduce the behavior:

  1. initialize mlcontext without a seed.
  2. create a model on data with FastForest.
  3. initialize a different mlcontext without a seed.
  4. create a model on the same exact data with FastForest.
  5. Compare the predictions of both models (in my case, I compared 23645 predictions)

repeat the process with lgbm.

Expected behavior As I see it, first FastForest predictions should be different then the second FastForest predictions. same in lightgbm.

Even the slightest change in randomness for the bootstrapped dataset selection should end up in different results.

It seems like the FastForest or LightGbm under ml.net are not so random.. :{

Further Research PLZ READ ME TOO I played with LightGbm on python and I was able to introduce randomness into it with feat these params: 'feature_fraction': 0.2,'seed': rand_num. removing one of them removes also the randomness in the results, see code:

import lightgbm as lgb
import pandas as pd
import numpy as np
np.random.seed(42)
num_train_samples = 1000
num_test_samples = 10
num_features = 10
X = np.random.rand(num_train_samples + num_test_samples, num_features)
y = np.random.uniform(0, 2, num_train_samples + num_test_samples)
y = y + np.random.normal(0, 0.2, num_train_samples + num_test_samples)
X_train, X_test = X[:num_train_samples], X[num_train_samples:]
y_train, y_test = y[:num_train_samples], y[num_train_samples:]
params1 = {
    'objective': 'regression',
    'verbose': -1,
    'feature_fraction': 0.2,
    'seed': 42
}
params2 = {
    'objective': 'regression',
    'verbose': -1,
    'feature_fraction': 0.2,
    'seed': 43
}
model1 = lgb.train(params1, lgb.Dataset(X_train, y_train), num_boost_round=1000)  # Increase boosting rounds
model2 = lgb.train(params2, lgb.Dataset(X_train, y_train), num_boost_round=1000)
y_pred1 = model1.predict(X_test)
y_pred2 = model2.predict(X_test)
results = pd.DataFrame({'true_value': y_test, 'model1_pred': y_pred1, 'model2_pred': y_pred2})
print(results)

WorkArounds A workaround for lightgbm would be to add FeatureFraction into the params. sending Seed as a part of FastForestRegressionTrainer.Options is a workaround for FastForest.