IBM / unitxt

🦄 Unitxt: a python library for getting data fired up and set for training and evaluation
https://www.unitxt.ai
Apache License 2.0
159 stars 44 forks source link

Seed control in unitxt #549

Open yoavkatz opened 9 months ago

yoavkatz commented 9 months ago

Today, unitxt uses a default seed (42) for all dataset. It's not actually possible to change the seed today. Changing the seed could effect the dataset significantly given random choices, so it should be controlled.

I initially thought it should be a parameter of the standard recipe so it will be explicit (also to ensure that HF caching will work correctly)

However, if we load multiple recipes and they set multiple seeds and collide.

@elronbandel @matanor - what do you think? I don't see a good solution.

matanor commented 9 months ago

I think it would have been nice if we could have had params set on the recipie (meaning, passed between operators along the pipeline), and then they could be accessed by the individual operators. Then you could have a sub_seed set there, and operators could have used to create their random generators based on the per-pipelline sub_seed and the global seed. Maybe that could be done by adding a dict of params to StreamingOperator?

Without that, a potential solution is setting the sub_seed on the instances.. like we do with other pipeline-level params (e.g. the name of the metrics).. that's IMO not a very nice long term solution, but maybe its ok, not sure.

What do you think?