facebookresearch / xformers

Hackable and optimized Transformers building blocks, supporting a composable construction.
https://facebookresearch.github.io/xformers/
Other
8.67k stars 619 forks source link

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)` #1144

Closed antony-frolov closed 2 weeks ago

antony-frolov commented 2 weeks ago

Hey! I'm trying to run the seqpar ops using the latest nvidia PyTorch container and am stuck at this particular error when running sequence_parallel_trailing_matmul with fuse=True:

File "/usr/local/lib/python3.10/dist-packages/xformers/ops/sequence_parallel_fused_ops.py", line 891, in my_matmul
[rank1]:     torch.matmul(gathered_input[dst_rank], w.t(), out=o)
[rank1]: RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

Here's a small example to recreate my error:

docker run -it --rm --gpus all nvcr.io/nvidia/pytorch:24.10-py3 /bin/bash -c "
    MAX_JOBS=20 pip install -v --no-deps git+https://github.com/facebookresearch/xformers.git@main#egg=xformers \
    && curl https://gist.githubusercontent.com/antony-frolov/63f61a0c5afc0bd19b58c07aae7ab9c8/raw/cc9daeffc9e005f12e05d7a020931b020977d105/seqpar.py --output ./seqpar.py \
    && cat ./seqpar.py \
    && torchrun --nproc-per-node 2 ./seqpar.py
"

Any ideas on what might be wrong with my setup or if it might be some bug in the source code?

antony-frolov commented 2 weeks ago

issue solved, i forgot to call torch.cuda.set_device(...) in my script