Open loubbrad opened 1 year ago
Hi,
This is probably due to a newer version of Triton being picked up, which is not compatible with the kernels we had written.
I'd recommend you using an older version of triton instead. The one we currently use is https://github.com/facebookresearch/xformers/blob/8e1673bd100699089fab1791969974495f3bbc6b/requirements-test.txt#L31
triton==2.0.0.dev20221105
Thanks for your reply. I am still failing to get a environment that allows me to use xFormerEncoderBlock with cuda. Given the requirement you have listed, my pip automatically downgrades pytorch to version 1.13.1. Pip also seems to install Triton 2.0.0 by default when installing the 2.0.0 version of pytorch.
Is there any resource for building a fresh working environment from scratch? Like a makefile? I'm sure this issue with Triton is temporary.
I faced the same error and double checked that error only happens on loss.backward()
call.
And it is true that triton==2.0.0.dev20221105
is incompatible with Torch 2.0.0
.
I tried to remove the triton and re-install it by using 'pip install triton==2.0.0.dev20221105',then this error is not occurred, but another error 'RuntimeError: CUDA error: device-side assert triggered' is occurred. I found this RuntimeError may be result from my installed CUDA version which is 11.3, while triton need 11.4+. Now I try to install xformers from source in a conda virtual environment.
CUDA version messages
Triton softmax kernel register spillover or invalid image caught.Deactivating this kernel, please file an issue int the xFormers repository Triton requires CUDA 11.4+ Triton layernorm kernel register spillover or invalid image caught. Deactivating this kernel, please file an issue in the xFormers repository Triton requires CUDA 11.4+
Final RuntimeError messages
Traceback (most recent call last): File "/root/source/lab/profun-som-dev/scripts/construct_gendis.py", line 176, in
main() File "/root/source/lab/profun-som-dev/scripts/construct_gendis.py", line 169, in main training_prog(opt) File "/root/source/lab/profun-som-dev/experiments/go/train.py", line 241, in training_prog scale_loss.backward() File "/root/miniconda3/envs/pytorch2.0/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/root/miniconda3/envs/pytorch2.0/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Maybe @fmassa ?
pip install triton==2.0.0.dev20221105 --no_deps
may be work, without downgrade pytorch2.0
Triton is debugging the same symptom here: https://github.com/openai/triton/issues/1271
I confirm that pip install triton==2.0.0.dev20221105 --no-deps
is compatible with PyTorch 2 and resolves the error. At least, for me on GTX3050 and CUDA 11.8.
In my case, this is because i add some other operation before FusedLayerNorm. Eg: add some parameters together. I fix this simplely be replace FusedLayerNorm with nn.LayerNorm. Maybe fusedLayerNorm's backward only support limited operation before it, i'm not sure. hope this will help
@vmarkovtsev Upgrading/Downgrading triton worked for me as well! Thanks
@Zyriix
I fix this simplely be replace FusedLayerNorm with nn.LayerNorm.
Just curious how did do that?
@Zyriix
I fix this simplely be replace FusedLayerNorm with nn.LayerNorm.
Just curious how did do that?
Just change the code in xformers/xformers/components/residual.py , in class PreNorm. Disable FusedLayerNorm. I'm not sure this is a good solution, because i got the error when i add an additional position embedding before fused layernorm. And when i change fused layernorm to nn.LayerNorm, it works. So maybe FusedLayerNorm has it's own cuda kernel or triton code or something.
🐛 Bug
I just rebuilt my environment, my model has stopped working when running on gpu (cpu is fine). The error is:
Argument rematerialization not implemented UNREACHABLE executed at /project/lib/Dialect/TritonGPU/Transforms/TritonGPUConversion.cpp:45! Aborted
I have reproduced this on a Debian VM as well as my machine.
To Reproduce
Steps to reproduce the behavior:
conda create -n test python=3.10.9 conda activate test pip3 install torch torchvision torchaudio pip install -U xformers
Then you may run the following python file
I get the error:
Argument rematerialization not implemented UNREACHABLE executed at /project/lib/Dialect/TritonGPU/Transforms/TritonGPUConversion.cpp:45! Aborted
with no python traceback provided. I think the error happens only during loss.backward()
Environment
PyTorch version: 2.0.0+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.2 LTS (x86_64) GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 Clang version: Could not collect CMake version: version 3.26.1 Libc version: glibc-2.35
Python version: 3.11.2 (main, Mar 27 2023, 23:42:44) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU Nvidia driver version: 527.56 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz CPU family: 6 Model: 165 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 Stepping: 2 BogoMIPS: 5184.01 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves flush_l1d arch_capabilities Virtualization: VT-x Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 192 KiB (6 instances) L1i cache: 192 KiB (6 instances) L2 cache: 1.5 MiB (6 instances) L3 cache: 12 MiB (1 instance) Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Unknown: Dependent on hypervisor status Vulnerability Tsx async abort: Not affected
Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.24.2 [pip3] pytorch-lightning==2.0.0 [pip3] torch==2.0.0 [pip3] torchmetrics==0.11.4 [conda] numpy 1.24.2 pypi_0 pypi[conda] pytorch-lightning 2.0.0 pypi_0 pypi[conda] torch 2.0.0 pypi_0 pypi[conda] torchmetrics 0.11.4 pypi_0 pypi
conda
,pip
, source): pip