loubbrad commented 1 year ago

🐛 Bug

I just rebuilt my environment, my model has stopped working when running on gpu (cpu is fine). The error is:

Argument rematerialization not implemented UNREACHABLE executed at /project/lib/Dialect/TritonGPU/Transforms/TritonGPUConversion.cpp:45! Aborted

I have reproduced this on a Debian VM as well as my machine.

To Reproduce

Steps to reproduce the behavior:

conda create -n test python=3.10.9 conda activate test pip3 install torch torchvision torchaudio pip install -U xformers

Then you may run the following python file

"""Includes (PyTorch) transformer model and config classes. Created using
the xFormers library."""

import torch
from torch import nn as nn
import torch.utils.checkpoint
from xformers.factory import xFormerEncoderBlock, xFormerEncoderConfig
from dataclasses import dataclass

@dataclass
class ModelConfig:
    d_model: int = 128
    n_heads: int = 8
    n_layers: int = 2
    ff_mult: int = 4
    drop_p = 0.1
    max_seq_len: int = 1024

    # Set according to tokenizer
    vocab_size: int = 10
    pad_id: int = 1
    mask_id: int = 3

    grad_checkpoint: bool = False
    att_mask: bool = None

class EncoderBlock(nn.Module):
    """Encoder block with rotary embeddings from xFormers library.

    Note that xFormer blocks expect batch first.

    Args:
        model_config (ModelConfig): Model config settings.
    """

    def __init__(self, model_config: ModelConfig, layer_id: int):
        super().__init__()
        self.layer_id = layer_id
        self.mask = model_config.att_mask

        encoder_config = {
            "dim_model": model_config.d_model,
            "residual_norm_style": "pre",
            "multi_head_config": {
                "num_heads": model_config.n_heads,
                "residual_dropout": model_config.drop_p,
                "use_rotary_embeddings": True,
                "attention": {
                    "name": "scaled_dot_product",
                    "dropout": model_config.drop_p,
                    "seq_len": model_config.max_seq_len,
                    "casual": False,
                    "use_rotary_embeddings": True,
                },
            },
            "feedforward_config": {
                "name": "MLP",
                "dropout": model_config.drop_p,
                "activation": "gelu",
                "hidden_layer_multiplier": 4,
            },
        }

        config = xFormerEncoderConfig(**encoder_config)
        self.encoder = xFormerEncoderBlock(config)
        # self.encoder = nn.TransformerEncoderLayer(
        #    model_config.d_model,
        #    model_config.n_heads,
        #    model_config.d_model,
        #    batch_first=True,
        # )

    def forward(self, src: torch.Tensor):
        """Forward pass for EncoderBlock.

        Args:
            src (torch.tensor): Input to encoder block, of shape (batch_size,
                seq_len, d_model).

        Returns:
            torch.tensor: forward pass of src through the encoder block.
        """
        return self.encoder(src, self.mask)

class MuseEncoder(nn.Module):
    """MuseEncoder with no additional model head.

    Args:
        model_config (ModelConfig): Model config settings.
    """

    def __init__(self, model_config: ModelConfig):
        super().__init__()

        self.model_config = model_config

        self.tok_embeddings = nn.Embedding(
            num_embeddings=model_config.vocab_size,
            embedding_dim=model_config.d_model,
            padding_idx=model_config.pad_id,
        )

        self.encode_layers = nn.ModuleList()
        for layer_id in range(model_config.n_layers):
            self.encode_layers.append(EncoderBlock(model_config, layer_id))

    def forward(self, src: torch.Tensor):
        """Forward pass of MuseEncoder.

        Args:
            src (torch.tensor): Input to encoder block, of shape (batch_size,
                seq_len, d_model).

        Returns:
            torch.tensor: Model outputs with shape (batch_size, seq_len,
                d_model).
        """

        hidden_states = self.tok_embeddings(src)

        # Implements gradient checkpoints on Encoder Layers.
        # TODO: Test that this doesn't change the gradient calculation
        # TODO: Do profiling for the memory/compute tradeoff
        if self.model_config.grad_checkpoint is True:
            for layer in self.encode_layers:

                def create_custom_forward(module):
                    def custom_forward(hidden_states):
                        return module(hidden_states)

                    return custom_forward

                hidden_states = torch.utils.checkpoint.checkpoint(
                    create_custom_forward(layer),
                    hidden_states,
                    preserve_rng_state=True,
                )

        else:
            for layer in self.encode_layers:
                hidden_states = layer(hidden_states)

        return hidden_states

class MuseMaskedLM(nn.Module):
    """MuseEncoder with head for masked language modelling.

    Args:
        model_config (ModelConfig): Model config settings.
    """

    def __init__(self, model_config: ModelConfig):
        super().__init__()

        self.model = MuseEncoder(model_config)
        self.lm_head = nn.Linear(
            model_config.d_model, model_config.vocab_size, bias=False
        )

    def forward(self, src: torch.Tensor):
        """Forward pass of MuseEncoder with MaskedLM head (logits output).

        Args:
            src (torch.tensor): Input to encoder block, of shape (batch_size,
                seq_len, d_model).

        Returns:
            torch.tensor: Forward pass of src through the encoder block.
        """
        logits = self.lm_head(self.model(src))

        return logits

def main():
    model_config = ModelConfig()
    model = MuseMaskedLM(model_config).cuda()

    src = torch.ones(1, 1024, dtype=torch.long).cuda()
    res = model(src)
    loss = res.mean()
    loss.backward()

if __name__ == "__main__":
    main()

I get the error:

Argument rematerialization not implemented UNREACHABLE executed at /project/lib/Dialect/TritonGPU/Transforms/TritonGPUConversion.cpp:45! Aborted

with no python traceback provided. I think the error happens only during loss.backward()

Environment

PyTorch version: 2.0.0+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.2 LTS (x86_64) GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 Clang version: Could not collect CMake version: version 3.26.1 Libc version: glibc-2.35

Python version: 3.11.2 (main, Mar 27 2023, 23:42:44) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU Nvidia driver version: 527.56 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz CPU family: 6 Model: 165 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 Stepping: 2 BogoMIPS: 5184.01 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves flush_l1d arch_capabilities Virtualization: VT-x Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 192 KiB (6 instances) L1i cache: 192 KiB (6 instances) L2 cache: 1.5 MiB (6 instances) L3 cache: 12 MiB (1 instance) Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Unknown: Dependent on hypervisor status Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.24.2 [pip3] pytorch-lightning==2.0.0 [pip3] torch==2.0.0 [pip3] torchmetrics==0.11.4 [conda] numpy 1.24.2 pypi_0 pypi[conda] pytorch-lightning 2.0.0 pypi_0 pypi[conda] torch 2.0.0 pypi_0 pypi[conda] torchmetrics 0.11.4 pypi_0 pypi

PyTorch Version (e.g., 1.0): 2.0.0
OS (e.g., Linux): Ubuntu WSL
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source): pip3 install torch torchvision torchaudio
Python version: 3.10.9
CUDA/cuDNN version: 11.7
GPU models and configuration:
Any other relevant information:

fmassa commented 1 year ago

Hi,

This is probably due to a newer version of Triton being picked up, which is not compatible with the kernels we had written.

I'd recommend you using an older version of triton instead. The one we currently use is https://github.com/facebookresearch/xformers/blob/8e1673bd100699089fab1791969974495f3bbc6b/requirements-test.txt#L31

triton==2.0.0.dev20221105

loubbrad commented 1 year ago

Thanks for your reply. I am still failing to get a environment that allows me to use xFormerEncoderBlock with cuda. Given the requirement you have listed, my pip automatically downgrades pytorch to version 1.13.1. Pip also seems to install Triton 2.0.0 by default when installing the 2.0.0 version of pytorch.

Is there any resource for building a fresh working environment from scratch? Like a makefile? I'm sure this issue with Triton is temporary.

SpirinEgor commented 1 year ago

I faced the same error and double checked that error only happens on loss.backward() call. And it is true that triton==2.0.0.dev20221105 is incompatible with Torch 2.0.0.

altriasjy31 commented 1 year ago

I tried to remove the triton and re-install it by using 'pip install triton==2.0.0.dev20221105'，then this error is not occurred, but another error 'RuntimeError: CUDA error: device-side assert triggered' is occurred. I found this RuntimeError may be result from my installed CUDA version which is 11.3, while triton need 11.4+. Now I try to install xformers from source in a conda virtual environment.

CUDA version messages

Triton softmax kernel register spillover or invalid image caught.Deactivating this kernel, please file an issue int the xFormers repository Triton requires CUDA 11.4+ Triton layernorm kernel register spillover or invalid image caught. Deactivating this kernel, please file an issue in the xFormers repository Triton requires CUDA 11.4+

Final RuntimeError messages

Traceback (most recent call last): File "/root/source/lab/profun-som-dev/scripts/construct_gendis.py", line 176, in main() File "/root/source/lab/profun-som-dev/scripts/construct_gendis.py", line 169, in main training_prog(opt) File "/root/source/lab/profun-som-dev/experiments/go/train.py", line 241, in training_prog scale_loss.backward() File "/root/miniconda3/envs/pytorch2.0/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/root/miniconda3/envs/pytorch2.0/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

danthe3rd commented 1 year ago

Maybe @fmassa ?

Rosenberg37 commented 1 year ago

pip install triton==2.0.0.dev20221105 --no_deps may be work, without downgrade pytorch2.0

vmarkovtsev commented 1 year ago

Triton is debugging the same symptom here: https://github.com/openai/triton/issues/1271

vmarkovtsev commented 1 year ago

I confirm that pip install triton==2.0.0.dev20221105 --no-deps is compatible with PyTorch 2 and resolves the error. At least, for me on GTX3050 and CUDA 11.8.

Zyriix commented 1 year ago

In my case, this is because i add some other operation before FusedLayerNorm. Eg: add some parameters together. I fix this simplely be replace FusedLayerNorm with nn.LayerNorm. Maybe fusedLayerNorm's backward only support limited operation before it, i'm not sure. hope this will help

shivammehta25 commented 1 year ago

@vmarkovtsev Upgrading/Downgrading triton worked for me as well! Thanks

shivammehta25 commented 1 year ago

@Zyriix

I fix this simplely be replace FusedLayerNorm with nn.LayerNorm.

Just curious how did do that?

Zyriix commented 1 year ago

@Zyriix

I fix this simplely be replace FusedLayerNorm with nn.LayerNorm.

Just curious how did do that?

Just change the code in xformers/xformers/components/residual.py , in class PreNorm. Disable FusedLayerNorm. I'm not sure this is a good solution, because i got the error when i add an additional position embedding before fused layernorm. And when i change fused layernorm to nn.LayerNorm, it works. So maybe FusedLayerNorm has it's own cuda kernel or triton code or something.

facebookresearch / xformers

Backend error when reinstalling environment: Argument rematerialization not implemented UNREACHABLE executed at /project/lib/Dialect/TritonGPU/Transforms/TritonGPUConversion.cpp:45! #705

🐛 Bug

To Reproduce

Environment