Closed rushi-the-neural-arch closed 3 years ago
Ugh... this is a series of mistypes in the quick start documentation on my end...
DistributedOptions
is an Enum of acceptable values so you have to call distribited=DistributedOptions.ddp.value
to get the representative value (in this case the return is the string ddp
) which is why when you pass in the string ddp
it works.
Updating the docs to reflect this mistake in my local fix branch
Yes this fixes the issue!
Describe the bug
Setting
distributed= DistributedOptions.ddp
argument in the Stoke Class as per the documentation doesn't seem to set the DDP parameter to true. It states,DDP (currently: False)
. However, I could find a workaround this bug by settingdistributed= ddp
Error msg -
ValueError: Stoke -- Fairscale extensions (currently: oss: True, sddp: True) requires CUDA (currently: True), GPU (currently: True), DDP (currently: False) and NCCL (currently: True)
To Reproduce
The sample script is posted here - Stoke-DDP
Just change the
distributed= ddp
parameter todistributed= DistributedOptions.ddp
in the Stoke class argument to reproduce the bugpython -m torch.distributed.launch Stoke-DDP.py --projectName "PyTorch-4K-2X" --batchSize 20 --nEpochs 2 --lr 1e-3 --threads 8
Expected behavior
Initialise the DDP based training
Screenshots/Code Snippets
Environment: