fidelity / stoke

A lightweight wrapper for PyTorch that provides a simple declarative API for context switching between devices, distributed modes, mixed-precision, and PyTorch extensions.
https://fidelity.github.io/stoke/
Apache License 2.0
66 stars 3 forks source link

ValueError: Stoke -- Fairscale extensions (currently: oss: True, sddp: True) requires CUDA (currently: True), GPU (currently: True), DDP (currently: False) and NCCL (currently: True) #26

Closed rushi-the-neural-arch closed 2 years ago

rushi-the-neural-arch commented 2 years ago

Describe the bug

Setting distributed= DistributedOptions.ddp argument in the Stoke Class as per the documentation doesn't seem to set the DDP parameter to true. It states, DDP (currently: False). However, I could find a workaround this bug by setting distributed= ddp

Error msg - ValueError: Stoke -- Fairscale extensions (currently: oss: True, sddp: True) requires CUDA (currently: True), GPU (currently: True), DDP (currently: False) and NCCL (currently: True)

To Reproduce

The sample script is posted here - Stoke-DDP

Just change the distributed= ddp parameter todistributed= DistributedOptions.ddp in the Stoke class argument to reproduce the bug

python -m torch.distributed.launch Stoke-DDP.py --projectName "PyTorch-4K-2X" --batchSize 20 --nEpochs 2 --lr 1e-3 --threads 8

Expected behavior

Initialise the DDP based training

Screenshots/Code Snippets


    stoke_model = Stoke(
        model=model,
        verbose=False,     # verbose just prints out stuff, throws an error somewhere so disabled it
        optimizer=optimizer,
        loss=loss,
        batch_size_per_device=opt.batchSize,
        gpu=True,
        fp16= None, #FP16Options.amp,
        distributed= DistributedOptions.ddp, #"ddp", #DistributedOptions.ddp
        fairscale_oss=True,
        fairscale_sddp=True,
        grad_accum_steps=4,
        grad_clip=opt.grad_clip,
        configs=[amp_config, ddp_config, oss_config]
    )

image

Environment:

ncilfone commented 2 years ago

Ugh... this is a series of mistypes in the quick start documentation on my end...

DistributedOptions is an Enum of acceptable values so you have to call distribited=DistributedOptions.ddp.value to get the representative value (in this case the return is the string ddp) which is why when you pass in the string ddp it works.

Updating the docs to reflect this mistake in my local fix branch

rushi-the-neural-arch commented 2 years ago

Yes this fixes the issue!