Open collinmccarthy opened 2 years ago
You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template. The following information is missing: "Instructions To Reproduce the Issue and Full Logs";
Apologies, I was inspecting model weights before create_ddp_model()
, which correctly syncs the parameters and buffers with rank 0. That means default_setup()
works and I do not need a separate call to seed_all_rng()
. I do, however, still need to pass in the seed to the TrainSampler for deterministic runs to work (in addition to setting cuDNN deterministic flag to True).
If _train_loader_from_config()
and build_detection_train_loader()
simply took an optional seed
argument and passed it to the Sampler's, I could just pass it in with my manual seed and that would solve the remaining issue.
Hello,
This is related to issues #2121 and #2615, but neither of these addressed my concerns.
When using an explicit seed with DDP, e.g.
cfg.SEED=0
, I expected the following:Instead, what I see is:
I think issue (1) above is particularly concerning, and (2) does impact deterministic behavior (which wasn't discussed in issue #2121 ). Issue (3) is fine as long as the seed already takes into account the rank.
My workaround is as follows:
seed_all_rng(cfg.SEED)
before creating the trainerseed_all_rng(cfg.SEED + rank)
build_detection_train_loader()
explicitlyAm I missing something about the underlying issues / current workflow w.r.t. setting the seed? Is there an easier / better workaround than what I'm proposing? Is this going to be "fixed" somehow in the future or is this something that's documented / expected behavior that users should be aware of?
Thank you, -Collin
Environment:
Instructions To Reproduce the Issue:
Any project that uses
default_setup
and an explicit seed will reproduce these issues. Example: DeepLabObserved by setting a breakpoint and inspecting weights on rank 0 and rank 1.