RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

Hey! I'm trying to run the seqpar ops using the latest nvidia PyTorch container and am stuck at this particular error when running sequence_parallel_trailing_matmul with fuse=True:

File "/usr/local/lib/python3.10/dist-packages/xformers/ops/sequence_parallel_fused_ops.py", line 891, in my_matmul
[rank1]:     torch.matmul(gathered_input[dst_rank], w.t(), out=o)
[rank1]: RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

Here's a small example to recreate my error:

docker run -it --rm --gpus all nvcr.io/nvidia/pytorch:24.10-py3 /bin/bash -c "
    MAX_JOBS=20 pip install -v --no-deps git+https://github.com/facebookresearch/xformers.git@main#egg=xformers \
    && curl https://gist.githubusercontent.com/antony-frolov/63f61a0c5afc0bd19b58c07aae7ab9c8/raw/cc9daeffc9e005f12e05d7a020931b020977d105/seqpar.py --output ./seqpar.py \
    && cat ./seqpar.py \
    && torchrun --nproc-per-node 2 ./seqpar.py
"

Any ideas on what might be wrong with my setup or if it might be some bug in the source code?

facebookresearch / xformers

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)` #1144