Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.2k stars 80 forks source link

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)` #1296

Open wprazuch opened 1 month ago

wprazuch commented 1 month ago

When running tiny-llama-1.1b in Thunder we get an error:

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

🐛 Bug

Full traceback:

0: Error: [rank0]: Traceback (most recent call last):
0: [rank0]:   File "/workspace/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 949, in <module>
0: [rank0]:     CLI(benchmark_main)
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/jsonargparse/_cli.py", line 96, in CLI
0: [rank0]:     return _run_component(components, init)
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/jsonargparse/_cli.py", line 204, in _run_component
0: [rank0]:     return component(**cfg)
0: [rank0]:   File "/workspace/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 861, in benchmark_main
0: [rank0]:     benchmark.train()
0: [rank0]:   File "/workspace/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 748, in train
0: [rank0]:     loss.backward()
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 624, in backward
0: [rank0]:     torch.autograd.backward(
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line [347](https://gitlab-master.nvidia.com/dl/jet/ci/-/jobs/115874010#L347), in backward
0: [rank0]:     _engine_run_backward(
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
0: [rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 307, in apply
0: [rank0]:     return user_fn(self, *args)
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 600, in wrapper
0: [rank0]:     outputs = fn(ctx, *args)
0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/executors/torch_autograd.py", line 96, in backward
0: [rank0]:     grads = ctx.compiled_backward([saved_tensors_list, ctx.saved_other], args)
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
0: [rank0]:     return func(*args, **kwargs)
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
0: [rank0]:     return func(*args, **kwargs)
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
0: [rank0]:     return func(*args, **kwargs)
0: [rank0]:   File "thunder.backward_fn_177", line 117, in backward_fn
0: [rank0]: RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

To Reproduce

Please use: 1 node(s), each with 8 GPUs. Image ""INTERNAL_IMAGE:pjnl-20241011"" Training script: python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \ --model_name tiny-llama-1.1b \ --distributed_mode fsdp \ --shard_mode zero3 \ --compile thunder \ --checkpoint_activations False \ --low_precision_mode none \ --micro_batch_size 18

Expected behavior

We should not see this error.

Environment

system.device_product_name DGXH100 system.gpu_driver_version 535.129.03 libraries.cuda 12.6.2.004 libraries.pip.lightning 2.4.0.dev20240728 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.7 libraries.pip.litgpt 0.4.11 libraries.pip.nvfuser 0.2.15+gitf3a2087 libraries.pip.pytorch-lightning 2.4.0 libraries.pip.torch 2.6.0a0+git4e89977 libraries.pip.torchmetrics 1.4.3 libraries.pip.torchvision 0.19.0a0+d23a6e1"

crcrpar commented 1 month ago

Failed to reproduce the error on a machine with 8 H100 cards in an interactive manner

IvanYashchuk commented 1 month ago

Could be a one-off problem either with software or hardware. Wojciech, please let us know if the problem persists in the next runs!

It could be due to the high batch size --micro_batch_size 18. Sometimes PyTorch and CUDA are not good at communicating that there's not enough memory (https://stackoverflow.com/a/64040256).