dssg / triage

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems
Other
185 stars 60 forks source link

random seed in parameters #948

Open silil opened 4 weeks ago

silil commented 4 weeks ago

Unfortunately, the random seed wasn't being set on any algorithm used in Triage. This leads to unreproducible results, mainly in the outreach-generated lists.

This bug was identified while comparing the outputs Triage generated on JoCo and DSaPP sides.

This pull request adds the random seed to the hyper-parameters of each model set up on the YAML file on the dictionary of the parameters and unique_parameters once the random seed has been generated. The changes affect the script src/triage/component/catwalk/model_trainers.py.

The tests associated with this change can be found in this notebook in the DoJo repository, where 3 different models have been trained in and outside of Triage using the same hyperparameters stored in the trained model by Triage and the scores obtained compared to verify that in both cases the scores are the same.

nanounanue commented 3 weeks ago

Hi @silil !

I will review this on my flight to Chile this Sunday. It seems correct at first glance, but I want to double-check some related code/design decisions we made at DSaPP a while ago regarding this parameter. If I recall correctly, we had a discussion about this behavior. Let me review it, and I will add comments to this PR accordingly.

silil commented 2 weeks ago

@KasunAmare do you have any comments?

silil commented 2 weeks ago

@nanounanue Did you have a chance to look at the changes?

nanounanue commented 2 weeks ago

I am on it. But still untangling it.

My concern is that in some point of time random_seed was agregated (#671) (including the instructions in experiment.yaml) , but caused that the experiment_hash changed if it wasn't set (if I remember correctly, issues #726 and #661) and that behaviour was fixed in #799. Sometime later @shaycrk patched a problem with existing models (they didn't use the original random seed, so that was fixed in #848. And there are methods like https://github.com/dssg/triage/blob/18bbbe49d69b5c905547b6b301bd0c4d3c35ad65/src/triage/component/catwalk/utils.py#L253 (added in #848).

And then, it was an issue related to when no random_seed is specified, the RandomGenerator of python didn't work correctly in multithread (maybe I am misusing that word). But I am not able to find that discussion :(

So, I am still checking all of those behaviours.

kasunamare commented 2 weeks ago

hey @nanounanue, From the code and the chat I had with @silil, I feel like the only change of this PR is making sure that we use the specified random seed when we initialize the model object? Currently, the random seeds that are generated at experiment level are not being passed on the constructor of the model class; and this PR is forcing it to do so by making it a part of the parameters. This will mean that if the Random Seed is changed (either intentionally or by not setting one with the experiment config), it will generate a separate model_hash (and id), and will result in us having multiple instances of the same model_group_id for the same train_end_time. By reading your notes on PR #799, I feel like that's desired behavior?

Am I missing something?

silil commented 2 weeks ago

Thanks @nanounanue.

I didn't change how Triage generates the random seed. This update ensures that the generated random seed is consistently used as part of the hyperparameters when initializing any predictive model (sklean, xgboost, or lightgbm). Specifically, it now explicitly assigns the random_seed to the random_state parameter. Previously, we didn't specify any value in random_state, which led to non-reproducible results even when using the same matrices and configurations initially used for training a model.

For example, a config file with random_seed: 23895478 will generate a random seed value of 1684690576 for the model. The latter will be saved in the triage_metadata.models table under the random_seed column (and in some other tables), while the former will be saved in the triage_metadata.triage_runs table in the random_seed column. However, the generated model random_seed won't be used when initializing the model, so sklearn, xgboost, or lightgbm would assign a different random seed that is not persisted in our pickle object. As a result, the exact results can't be reproduced.

I hope this clarifies things. Let me know if you'd like to go over this in a meeting.

nanounanue commented 6 days ago

Thank you @silil and @kasunamare.

Just one more question, when you mention unreproducible (is this a word?) results ... do you mean the predicted entities list?

If you prefer we can have a video call to see this behaviour

silil commented 4 days ago

@nanounanue @kasunamare:

Following our conversation with @rayidghani last Friday, I made changes to the code. The random seed is now set only at the moment of instantiation, right before fitting the model. In this way:

  1. The hyperparameters stored in the DB don't include the random seed.
  2. The model hash is not affected by the random seed.
  3. The random seed is included in the model pickle.

I tested these changes (with @kasunamare) and the results are as follows:

When using a different experiment random seed in the yaml file but keeping the same model and hyperparameters we observe for each time split:

In DB: Screenshot from 2024-09-09 10-28-14

In pickle: Screenshot from 2024-09-09 10-26-00

Is that the expected behavior? I assumed changing the random seed in the experiment would result in different model random seeds. What do you think @nanounanue @kasunamare @rayidghani?

nanounanue commented 3 days ago

Regarding the model ID and model hash, I believe that's the desired behavior (or at least, that was the consensus). The rationale is that the random_seed of the model is not a hyperparameter, so it shouldn't change the identifier (i.e., the model’s behavior should be resilient to the seed). However, this doesn't mean we want several models with the same model ID. Again, the predictions shouldn't be heavily affected by the seed, meaning the model should be relatively "stable."

That said, I’m not sure what @rayidghani thinks about this.

Regarding, the model random_state, that's generated based on the following code https://github.com/dssg/triage/blob/18bbbe49d69b5c905547b6b301bd0c4d3c35ad65/src/triage/component/catwalk/model_trainers.py#L444

I will check if that's overwriting your code.

nanounanue commented 3 days ago

If the list is very different between runs, maybe the problem is the seed in postgresql (the rank and all of that)

silil commented 1 day ago

The lists generated on both sides (DSaPP and JoCo) weren't that different, but we were expecting to have the same list (we know we have the same data in the DB on both sides), we were assuming that because we were using the same random seed, that's how we found out we weren't using our random seed in the model.

kasunamare commented 1 day ago

Going by the screenshot @silil has posted, it looks like changing the experiment_random_seed doesn't result in a new set of model_random_seed values? I'm guessing that's not our desired behavior?

My understanding of the desired behavior for changing the random seed at the experiment level:

  1. It generates a series of new model (or model group) level random seeds
  2. These new model/model group level seeds should not create a new model group (model_group_id) as it's not a hyperparameter,
  3. However, it should create new instances of the model group (models) for the same train_end_times (new model_id and model_hash)
  4. This results in multiple instances of the model group for the same train_end_times. If the model is relatively stable, the variance in performance across these models should not be high.

Is this correct @nanounanue? If so, looks like Step 1. is not working?

nanounanue commented 9 hours ago

Yes, totally agree.

It shouldn't be a new model group, but maybe a new model_id. So if we change the random seed, now we have several point in each as_of_date per model_group_id. So nothing is broken (apparently) in training, but it will be broken in aequitas, postmodeling, (maybe in how the evaluation is calculated? the stochastic one) and plotting, right?