DDP training gets terminated in the middle of the training because of some SIGKILL received by a PID (forked child process)

Describe the bug

traceback : Signal 9 (SIGKILL) received by PID 10398

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 10401 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 10398) of binary: /anaconda/envs/py37_default/bin/python Traceback (most recent call last): File "/anaconda/envs/py37_default/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/anaconda/envs/py37_default/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/..../torch/distributed/launch.py", line 193, in main() File "/home/.../torch/distributed/launch.py", line 189, in main launch(args) File "/home/.../torch/distributed/launch.py", line 174, in launch run(args) File "/home/.../torch/distributed/run.py", line 713, in run )(*cmd_args) File "/home/.../torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/jupyte.../torch/distributed/launcher/api.py", line 261, in launch_agent failures=result.failures,

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

To Reproduce

Just a normal stoke training script


  stoke_model = Stoke(
      model=model,
      verbose=True,    
      optimizer=optimizer,
      loss=loss,
      batch_size_per_device= opt.batchSize,   
      gpu=True,   
      fp16=None,
      distributed=DistributedOptions.ddp.value,
      fairscale_oss=True, 
      fairscale_sddp=True, 
      grad_accum_steps=1,
      configs= [amp_config, ddp_config, oss_config],     
      grad_clip=ClipGradNormConfig(max_norm = opt.grad_clip, norm_type=2.0),
  )

def train(train_dataloader, stoke_model: Stoke, scheduler1, scheduler2, epoch: int):

    example_ct = 0  # number of examples seen
    batch_ct = 0
    sum_loss = 0

    stoke_model.print_on_devices(f"Starting Epoch {epoch + 1}")
    stoke_model.model_access.train()

    for idx, (inputs, targets) in enumerate(train_dataloader):

        # call the model through the stoke onkect interface
        outputs = stoke_model.model(inputs)
        train_loss = stoke_model.loss(outputs, targets)

        stoke_model.print_ema_loss(prepend_msg=f"Step {idx+1} -- EMA Loss")

        # Call backward through the stoke object interface
        stoke_model.backward(loss=train_loss)

        # Call step through the stoke object interface
        stoke_model.step()
        scheduler1.step()
        scheduler2.step

        sum_loss += train_loss

        example_ct +=  len(inputs)
        batch_ct += 1

        # Report metrics every 50th batch
        if ((batch_ct + 1) % 50) == 0:
            train_log(train_loss, example_ct, epoch)
            #print(train_loss,  example_ct, epoch)

    avg_loss = sum_loss / len(train_dataloader)

    return avg_loss

for epoch in tqdm(range(epochs), leave=True): 

        train_loss = train(train_dataloader, stoke_model, scheduler1, scheduler2, epoch)
        val_loss = validate(val_dataloader, stoke_model, epoch)
        save_checkpoint(stoke_model, epoch, train_loss, val_loss)

The actual script is posted here - https://gist.github.com/rushi-the-neural-arch/bee47ba87e5ddabf0cb47def9bc0b013

Ran config as - env CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 Stoke-DDP.py --projectName "Stoke-4K-2X-DDP" --batchSize 18 --nEpochs 2 --lr 1e-3 --weight_decay 1e-4 --grad_clip 0.1
Error produced is - WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 10401 closing signal SIGTERM

Expected behavior

Ohkay so I know this issue is more of a PyTorch DDP concern and not a Stoke issue as I found many users face this problem and there doesn't seem any definitive solution for this apart from downgrading the torch version. You can see here, a workaround just 2 days ago - https://github.com/pytorch/pytorch/issues/67538, the user downgraded his torch version from 1.10 to 1.8 which solved this particular issue but as Stoke requires torch version to be greater than 1.81, I guess this would not be possible for us. Maybe torch 1.10 version is just recently rolled out so they might not have fixed this from their end but do you happen to know any alternative approach/solution for this??

And also giving you a bit more context, I trained a sample very lightweight neural network and could do the training easily for larger batch sizes, I did a few experimentations and after gaining some perspective, I switched to a heavier-more parameter (~4.5M) network for training but now this error started occurring. Initially, I thought this might be due to more load being exerted on the RAM so I decreased the batch size to 1 and also removed the gradient accumulation step, played around with num_workers parameters but this didn't solve the error. In fact, what I have noticed is that this error occurs in the middle exactly after 125 steps! which seems weird as there is no code that relates to some operation after 125 steps or after a specific number of steps

EDIT - I tried the FP16 training and the error still persists but it's after 145 steps now.

Screenshots/Code Snippets

Environment:

OS: Ubuntu 18.04.5,
Python version - 3.7.7
PyTorch Version - 1.10:
Deepspeed Version: 0.5.4
Horovod Version: 0.23
Fairscale Version: 0.4.0
CUDA/cuDNN version: 11.2 / 7.6.2
Stoke configuration: 0.2.0

@rushi-the-neural-arch :wave:

Hmmmm... Return code -9 means it was most likely a SIGKILL probably due to an OOM kill from the OS or from CUDA (sometimes these errors throw nothing helpful).

I think you are making a mistake in the summed loss (not detaching the tensor from the graph). See the added comments below on part of your code:


def train(train_dataloader, stoke_model: Stoke, scheduler1, scheduler2, epoch: int):

    for idx, (inputs, targets) in enumerate(train_dataloader):

        # call the model through the stoke onkect interface
        outputs = stoke_model.model(inputs)

        # This is the loss tensor(s)... remember here that this tensor is still attached to the compute graph and Stoke handles it no differently than base torch
        train_loss = stoke_model.loss(outputs, targets)

        ### This is just an ema of the step loss -- shouldn't ever get reset by Stoke
        stoke_model.print_ema_loss(prepend_msg=f"Step {idx+1} -- EMA Loss")

        # Call backward through the stoke object interface
        stoke_model.backward(loss=train_loss)

        # Call step through the stoke object interface
        stoke_model.step()
        scheduler1.step()
        scheduler2.step

        ### I think this is where you have a mistake that would lead to an OOM SIGKILL -- typically you call .detach() on the tensor to remove it from the graph when you want to create a running sum of the loss
        sum_loss += train_loss
       ### Swap with this handy stoke function that will sync and detach across all devices
       ### https://fidelity.github.io/stoke/reference/stoke/stoke/#detach_and_sync_loss
        sum_loss += stoke_model.detach_and_sync_loss(loss=train_loss)

    avg_loss = sum_loss / len(train_dataloader)

    return avg_loss

fidelity / stoke