dssg / triage

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems
Other
182 stars 61 forks source link

Reuse Random Seeds from Existing Models #848

Closed shaycrk closed 3 years ago

shaycrk commented 3 years ago

WIP -- still needs testing and documentation, but would appreciate feedback on the approach here.

Because model-level random seeds are generated by random draws (seeded by the experiment-level random seed), any changes in the grid configuration will result in new seeds being generated even for model configurations that have already been run, causing new models to be trained even when this is not intended.

To address this, this PR looks for existing models matching the current parameters (model group + train matrix metadata) where the experiment-level random seed matches the current experiment, then re-uses that model's random seed (choosing the model with the most recent run_time if more than one is found).

shaycrk commented 3 years ago

Also, one related question: given that we exclude the random seed from the experiment_hash but store the seed in triage_metadata.experiments (for which the experiment_hash is a primary key), what is the intended behavior when an experiment is re-run with only the seed changed? Should the random seed be stored in experiment_runs instead?

shaycrk commented 3 years ago

Did a little testing and added a unit test for this functionality, so I think this is generally working. Switching the PR out of draft mode.

shaycrk commented 3 years ago

Thanks @thcrock -- any chance you have thoughts on the other question: given that we exclude the random seed from the experiment_hash but store the seed in triage_metadata.experiments (for which the experiment_hash is a primary key), what is the intended behavior when an experiment is re-run with only the seed changed? Should the random seed be stored in experiment_runs instead?

thcrock commented 3 years ago

I think experiment_runs makes sense.

shaycrk commented 3 years ago

It looks like the random seed was being stored to both triage_metadata.experiments and triage_metadata.experiment_runs, but with no clear intended behavior for when an experiment config is re-run with a different seed, so pushing an update to remove that column from the experiments table and just rely on experiment_runs for that info.

In doing so, I also ran into an issue where alembic was trying to drop all the partitioned predictions table, so added an exclusion for those tables in future runs based on the discussion here and this gist

One question for review:

thcrock commented 3 years ago

I think just getting the schema right on downgrade is sufficient. It's not just behavior that we are changing, it's behavior we never had a use for in the first place. So to me this makes sense to not be as strict about reproducing the exact state of the database on downgrade.

shaycrk commented 3 years ago

Thanks @thcrock -- that makes sense to me! I think I have everything in good shape now and the tests all working, so please take another look when you have a chance.