Sequence-data-parallel seems to be currently broken. Example (job 31bd5fee-99aa-4db6-9aed-7014cc5124aa):
2024-11-19 22:21:07,580 [Rank 00] Traceback (most recent call last):
File "/app/fast_llm/tools/cli.py", line 29, in fast_llm
Runnable.parse_and_run(unparsed)
File "/app/fast_llm/engine/config_utils/runnable.py", line 36, in parse_and_run
runnable()
File "/app/fast_llm/engine/training/config.py", line 373, in runnable
trainer.run()
File "/app/fast_llm/engine/training/trainer.py", line 141, in run
self._run_training()
File "/app/fast_llm/engine/training/trainer.py", line 155, in _run_training
done, metrics = self._train()
File "/app/fast_llm/engine/training/trainer.py", line 210, in _train
reduced_losses, update_successful, train_metrics = self._runner.run_step(
File "/app/fast_llm/engine/schedule/runner.py", line 194, in run_step
self._train_step(context, step)
File "/app/fast_llm/engine/schedule/runner.py", line 302, in _train_step
output = self._backward(context, step)
File "/app/fast_llm/engine/schedule/runner.py", line 407, in _backward
input_grad = self._stages[step.stage].backward(
File "/app/fast_llm/engine/multi_stage/stage.py", line 114, in backward
output.backward(output_grad)
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 520, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 288, in backward
_engine_run_backward(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py", line 767, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 305, in apply
return user_fn(self, *args)
File "/app/fast_llm/layers/transformer/attention.py", line 37, in backward
grad = y.grad + grad_output
TypeError: unsupported operand type(s) for +: 'NoneType' and 'Tensor'
That's about the summing of kv gradients between all sdp splits, something must have broke at some point. Also I don't see any sdp in the tests, that's bad.
🐞 Describe the Bug
Sequence-data-parallel seems to be currently broken. Example (job 31bd5fee-99aa-4db6-9aed-7014cc5124aa):
That's about the summing of kv gradients between all sdp splits, something must have broke at some point. Also I don't see any sdp in the tests, that's bad.
🔄 Steps to Reproduce
Run anything with sequence-data-parallel>1.
🎯 Expected Behavior
Not crashing.