Open clessig opened 5 hours ago
It's possible. We use 32bit indexing so when tensors get larger than 2GB or 4GB the indexing might be wrong. Can you help us reproduce the error, e.g. with a short script?
I just tried to write a small repo case with just one MHA-Varlen but couldn't reproduce it.
Is it possible that the error depends on the entire graph for my real-world network?
If you can save the tensors (q, k, v, and gradient) that caused the IMA you can load them back up in a script.
I obtain the following error when when my length of chunks/batches becomes large:
File "/gpfs/home/ecm/ecm327663/obs6/ai-obs-experimental-transformer/pyenv312/lib/python3.12/site-packages/flash_attn-2.6.3-py3.12-linux-x86_64.egg/flash_attn/flash_attn_interface.py", line 198, in _flash_attn_varlen_backward ) = flash_attn_cuda.varlen_bwd( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: an illegal memory access was encountered
Is it possible that there is an implicit max length for the number of chunks/batches that is not covered by checks (potentially with some memory space running out)?