OpenBMB / BMInf

Efficient Inference for Big Models
Apache License 2.0
572 stars 67 forks source link

[BUG] Error was raised when importing model in v1.0.x #40

Open sdjksdafji opened 2 years ago

sdjksdafji commented 2 years ago

Describe the bug CUDA error was raised when importing models. This issue only happens with BMInf 1.0.x version. I could run BmInf 0.0.5 successfully. Any help would be appreciated. Thanks.

Minimal steps to reproduce Tried the following on both WSL2 Ubuntu 20.04 with GTX 3080 16G and native Ubuntu 18.04 with GTX 1070 8G

conda create --name bminfnew python=3.8
conda activate bminfnew
conda install cudatoolkit=11.3
pip install bminf==1.0.1

Then run

import bminf
cpm2 = bminf.models.CPM2()

Expected behavior Start downloading the model.

Screenshots

Python 3.8.12 (default, Oct 12 2021, 13:49:34) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import bminf
>>> cpm2 = bminf.models.CPM2()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mira/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/models/cpm2.py", line 55, in __init__
    SizeLimitedAllocator( self._cudaAlloc.allocate( dynamic_memory ))
  File "/home/mira/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/core/allocators/cuda.py", line 20, in allocate
    ptr = cudart.cudaMalloc(nbytes).value
  File "/home/mira/miniconda3/envs/bminfnew/lib/python3.8/site-packages/cpm_kernels/library/base.py", line 94, in wrapper
    return f(*args, **kwargs)
  File "/home/mira/miniconda3/envs/bminfnew/lib/python3.8/site-packages/cpm_kernels/library/cudart.py", line 375, in cudaMalloc
    checkCUDAStatus(cuda.cudaMalloc(ctypes.byref(ptr), size))
  File "/home/mira/miniconda3/envs/bminfnew/lib/python3.8/site-packages/cpm_kernels/library/cudart.py", line 327, in checkCUDAStatus
    raise RuntimeError("CUDA Runtime Error: %s" % cudaGetErrorString(error))
RuntimeError: CUDA Runtime Error: out of memory

Environment: Tried with various cuda versions including 10.2 11.0 and 11.3

a710128 commented 2 years ago

BMinf will request 512MB of memory before loading the model. From your screenshot, it seems that the error is happening here. I'm going to spend some time trying to reproduce this error.

sdjksdafji commented 2 years ago

@a710128 Thanks for the quick response. Please keep me updated.

a710128 commented 2 years ago

@sdjksdafji

I ran the examples with my GTX 1070 on Windows. Everything turned out fine. Could it be that the conda environment is causing some effects? Also, ave you tried running generate_cpm1.py ?

sdjksdafji commented 2 years ago

@a710128 I tried to import all 3 models. Surprisingly, CPM1 is fine. It started downloading after model = bminf.models.CPM1(). However, CPM2 and EVA reported the same CUDA OOM error.

Seems like your env is windows. Could you try it under Linux?

a710128 commented 2 years ago

@a710128 I tried to import all 3 models. Surprisingly, CPM1 is fine. It started downloading after model = bminf.models.CPM1(). However, CPM2 and EVA reported the same CUDA OOM error.

Seems like your env is windows. Could you try it under Linux?

I've tested it under Ubuntu 20.04 using 1080Ti, 2080Ti, V100 and it works fine.

sdjksdafji commented 2 years ago

@a710128 Could you share your installation script and cuda version?

a710128 commented 2 years ago

I'm confused that importing CPM1 and importing CPM2 will run almost the same code. But importing CPM2 gives an error at line 55. https://github.com/OpenBMB/BMInf/blob/45d0af959f8017ca78bc18e03a660daf77c46852/bminf/models/cpm2.py#L55

CPM1 https://github.com/OpenBMB/BMInf/blob/45d0af959f8017ca78bc18e03a660daf77c46852/bminf/models/cpm1.py#L26-L51

CPM2 https://github.com/OpenBMB/BMInf/blob/45d0af959f8017ca78bc18e03a660daf77c46852/bminf/models/cpm2.py#L31-L56

a710128 commented 2 years ago

@a710128 Could you share your installation script and cuda version?

pip install bminf CUDA 11.1

sdjksdafji commented 2 years ago

@a710128 Actually the previous logs are not matched with my latest runs. The error actually comes from the T5 model file during the first init of the pinned decoder layer. code: self.dec_layers[i].init_data(pinned=True) This could explain why the CPM1 works fine but EVA and CPM2 are not.

Here is my script

import bminf
from cpm_kernels.library import cudart

print(bminf.__version__)

print(cudart.cudaGetDeviceCount())
print(cudart.cudaRuntimeGetVersion())
print(cudart.cudaDriverGetVersion())

cpm2 = bminf.models.CPM2()

And the output is:

1.0.1
1
10020
11060
Traceback (most recent call last):
  File "/home/sdjksdafji/repo/mira/bminf-backend/debug.py", line 10, in <module>
    cpm2 = bminf.models.CPM2()
  File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/models/cpm2.py", line 57, in __init__
    self._model = T5Model(config)
  File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/arch/t5/model.py", line 107, in __init__
    self.dec_layers[i].init_data(pinned=True)
  File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/core/layer.py", line 158, in init_data
    ptr = cudart.cudaMallocHost(self.nbytes)
  File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/cpm_kernels/library/base.py", line 94, in wrapper
    return f(*args, **kwargs)
  File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/cpm_kernels/library/cudart.py", line 385, in cudaMallocHost
    checkCUDAStatus(cuda.cudaMallocHost(ctypes.byref(ptr), size))
  File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/cpm_kernels/library/cudart.py", line 327, in checkCUDAStatus
    raise RuntimeError("CUDA Runtime Error: %s" % cudaGetErrorString(error))
RuntimeError: CUDA Runtime Error: out of memory

Process finished with exit code 1
a710128 commented 2 years ago

@sdjksdafji Try BMInf 1.0.2

sdjksdafji commented 2 years ago

@a710128 Thanks for the fix. I tried 1.0.2. The import works fine for me but the inference does not. Here is the latest error:

>>> import bminf

>>> cpm2 = bminf.models.CPM2()
Downloading cpm2.1-new/checkpoint.pt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.3G/11.3G [32:51<00:00, 5.73MiB/s]
Downloading cpm2.1-new/vocab.txt: 160kiB [00:00, 2.72MiB/s]
>>> text = "北京环球度假区相关负责人介绍,北京环球影城指定单日门票将采用<span>制度,即推出淡季日、平季日、旺季日和特定日门票。<span>价格为418元,<span>价格为528元,<span>价格为638元,<span>价格为<span>元。北京环球度假区将提供90天滚动价格日历,以方便游客提前规划行程。"
>>> for result in cpm2.fill_blank(text, 
...     top_p=1.0,
...     top_n=5, 
...     temperature=0.5,
...     frequency_penalty=0,
...     presence_penalty=0
... ):
...     value = result["text"]
...     text = text.replace("<span>", "\033[0;32m" + value + "\033[0m", 1)
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/models/cpm2.py", line 245, in fill_blank
    for token in res:
  File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/models/cpm2.py", line 135, in _gen_iter
    self._model.encode(
  File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/arch/t5/model.py", line 195, in encode
    layer.forward(
  File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/layers/transformer_block.py", line 34, in forward
    self.self_attn.forward(ctx, x_mid, x_mid, mask, position_bias, x_mid)
  File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/layers/attention.py", line 42, in forward
    self.project_q.forward(ctx, hidden_q, h_q)
  File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/layers/linear.py", line 43, in forward
    ck.gemm_int8(
  File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/cpm_kernels/kernels/gemm.py", line 172, in gemm_int8
    cublaslt.cublasLtMatmul(
  File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/cpm_kernels/library/base.py", line 94, in wrapper
    return f(*args, **kwargs)
  File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/cpm_kernels/library/cublaslt.py", line 137, in cublasLtMatmul
    checkCublasStatus(cublasLt.cublasLtMatmul(
  File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/cpm_kernels/library/cublaslt.py", line 98, in checkCublasStatus
    raise RuntimeError("CUBLAS error: {}".format(
RuntimeError: CUBLAS error: CUBLAS_STATUS_EXECUTION_FAILED

BTW, I have some questions regarding the fix. Seems like the actual fix here is to use the non-cuda pinned numpy array if the cuda malloc operation fails. Even if it worked, would it affect the inference performance? I assume now the computation happens on CPU instead of GPU, right? Instead of a fix, to me, this sounds like a workaround with the sacrifice of perf. Shall we try to figure out the root cause of the failed CUDA malloc? My 3080 has 16g GPU-MEM so the OOM error definitely does not make sense.

a710128 commented 2 years ago

Seems like the actual fix here is to use the non-cuda pinned numpy array if the cuda malloc operation fails. Even if it worked, would it affect the inference performance?I assume now the computation happens on CPU instead of GPU, right?

Even if a non-cuda pinned numpy array is used, the computation still happens on the GPU. The difference is that non-pinned memory spends more time transferring data from CPU to GPU.

Shall we try to figure out the root cause of the failed CUDA malloc? I think the root cause of failed memory requests is because of some system limitations. Some operating systems limit the total size of pinned memory.