Hey! I'm trying to run the seqpar ops using the latest nvidia PyTorch container and am stuck at this particular error when running sequence_parallel_trailing_matmul with fuse=True:
File "/usr/local/lib/python3.10/dist-packages/xformers/ops/sequence_parallel_fused_ops.py", line 891, in my_matmul
[rank1]: torch.matmul(gathered_input[dst_rank], w.t(), out=o)
[rank1]: RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Hey! I'm trying to run the seqpar ops using the latest nvidia PyTorch container and am stuck at this particular error when running
sequence_parallel_trailing_matmul
withfuse=True
:Here's a small example to recreate my error:
Any ideas on what might be wrong with my setup or if it might be some bug in the source code?