🐞 Describe the Bug

Sequence-data-parallel seems to be currently broken. Example (job 31bd5fee-99aa-4db6-9aed-7014cc5124aa):

2024-11-19 22:21:07,580 [Rank 00] Traceback (most recent call last):
  File "/app/fast_llm/tools/cli.py", line 29, in fast_llm
    Runnable.parse_and_run(unparsed)
  File "/app/fast_llm/engine/config_utils/runnable.py", line 36, in parse_and_run
    runnable()
  File "/app/fast_llm/engine/training/config.py", line 373, in runnable
    trainer.run()
  File "/app/fast_llm/engine/training/trainer.py", line 141, in run
    self._run_training()
  File "/app/fast_llm/engine/training/trainer.py", line 155, in _run_training
    done, metrics = self._train()
  File "/app/fast_llm/engine/training/trainer.py", line 210, in _train
    reduced_losses, update_successful, train_metrics = self._runner.run_step(
  File "/app/fast_llm/engine/schedule/runner.py", line 194, in run_step
    self._train_step(context, step)
  File "/app/fast_llm/engine/schedule/runner.py", line 302, in _train_step
    output = self._backward(context, step)
  File "/app/fast_llm/engine/schedule/runner.py", line 407, in _backward
    input_grad = self._stages[step.stage].backward(
  File "/app/fast_llm/engine/multi_stage/stage.py", line 114, in backward
    output.backward(output_grad)
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 520, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 288, in backward
    _engine_run_backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py", line 767, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 305, in apply
    return user_fn(self, *args)
  File "/app/fast_llm/layers/transformer/attention.py", line 37, in backward
    grad = y.grad + grad_output
TypeError: unsupported operand type(s) for +: 'NoneType' and 'Tensor'

That's about the summing of kv gradients between all sdp splits, something must have broke at some point. Also I don't see any sdp in the tests, that's bad.

🔄 Steps to Reproduce

Run anything with sequence-data-parallel>1.

🎯 Expected Behavior

Not crashing.

ServiceNow / Fast-LLM

[bug] Crash with sequence-data-parallel #57

🐞 Describe the Bug

🔄 Steps to Reproduce

🎯 Expected Behavior