OOM - Githubissues

IT-five commented 11 months ago

我想比较直接外推、插值、NTK的效果，于是我在pred.py中将截断部分的逻辑删除，于是在seq_len=21000的长度时，我的A800报出了OOM的错误，这是正常的吗，请问应该怎么解决，我没用xformer

bys0318 commented 11 months ago

您好！baichuan的modeling代码里目前似乎用xops.memory_efficient_attention来启用flash-attention，不用flash-attention的话显存占用会随长度二次方增长，20k的长度下推理确实可能会OOM。建议您通过from xformers import ops as xops来启用flash-attention。

IT-five commented 11 months ago

您好！baichuan的modeling代码里目前似乎用xops.memory_efficient_attention来启用flash-attention，不用flash-attention的话显存占用会随长度二次方增长，20k的长度下推理确实可能会OOM。建议您通过from xformers import ops as xops来启用flash-attention。

感谢您的回答，可是我在如下环境中安装xformers一直会帮我重新安装torch，然后重装失败。具体应该怎么做？

bitsandbytes                  0.41.1
open-clip-torch               2.20.0
peft                          0.5.0
pytorch-lightning             1.7.7
pytorch-metric-learning       2.3.0
pytorch-wavelets              1.3.0
pytorch-wpe                   0.0.1
pytorch3d                     0.7.4
rotary-embedding-torch        0.3.0
sentencepiece                 0.1.99
taming-transformers-rom1504   0.0.6
torch                         2.0.1+cu118
torch-complex                 0.4.3
torch-scatter                 2.1.1
torchaudio                    2.0.2+cu118
torchmetrics                  0.11.4
torchsummary                  1.5.1
torchvision                   0.15.2+cu118
transformers                  4.34.1
transformers-stream-generator 0.0.4

DEPRECATION: pytorch-lightning 1.7.7 has a non-standard dependency specifier torch>=1.9.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063
Installing collected packages: triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, nvidia-cusparse-cu12, nvidia-cudnn-cu12, nvidia-cusolver-cu12, torch, xformers
  Attempting uninstall: triton
    Found existing installation: triton 2.0.0
    Uninstalling triton-2.0.0:
      Successfully uninstalled triton-2.0.0
  Attempting uninstall: torch
    Found existing installation: torch 2.0.1+cu118
    Uninstalling torch-2.0.1+cu118:
      Successfully uninstalled torch-2.0.1+cu118
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
fairseq 0.12.2 requires hydra-core<1.1,>=1.0.7, but you have hydra-core 1.3.2 which is incompatible.
fairseq 0.12.2 requires omegaconf<2.1, but you have omegaconf 2.3.0 which is incompatible.
fastai 2.7.12 requires torch<2.1,>=1.7, but you have torch 2.1.1 which is incompatible.
torchaudio 2.0.2+cu118 requires torch==2.0.1, but you have torch 2.1.1 which is incompatible.
torchvision 0.15.2+cu118 requires torch==2.0.1, but you have torch 2.1.1 which is incompatible.
wenetruntime 1.11.0 requires torch==1.11.0, but you have torch 2.1.1 which is incompatible.
Successfully installed nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.18.1 nvidia-nvjitlink-cu12-12.3.101 nvidia-nvtx-cu12-12.1.105 torch-2.1.1 triton-2.1.0 xformers-0.0.23
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
root@dsw-5143-9979c9f66-knjrl:/mnt/workspace/extend_length/test# python
Python 3.8.16 (default, Jun 12 2023, 18:09:05) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
/opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

bys0318 commented 11 months ago

您遇到的似乎是一些package之间的冲突问题，您可以试试去xformer的github下询问

THUDM / LongBench

OOM #45