Open sdjksdafji opened 2 years ago
BMinf will request 512MB of memory before loading the model. From your screenshot, it seems that the error is happening here. I'm going to spend some time trying to reproduce this error.
@a710128 Thanks for the quick response. Please keep me updated.
@sdjksdafji
I ran the examples with my GTX 1070 on Windows. Everything turned out fine. Could it be that the conda environment is causing some effects?
Also, ave you tried running generate_cpm1.py
?
@a710128 I tried to import all 3 models. Surprisingly, CPM1 is fine. It started downloading after model = bminf.models.CPM1()
. However, CPM2 and EVA reported the same CUDA OOM error.
Seems like your env is windows. Could you try it under Linux?
@a710128 I tried to import all 3 models. Surprisingly, CPM1 is fine. It started downloading after
model = bminf.models.CPM1()
. However, CPM2 and EVA reported the same CUDA OOM error.Seems like your env is windows. Could you try it under Linux?
I've tested it under Ubuntu 20.04 using 1080Ti, 2080Ti, V100 and it works fine.
@a710128 Could you share your installation script and cuda version?
I'm confused that importing CPM1 and importing CPM2 will run almost the same code. But importing CPM2 gives an error at line 55. https://github.com/OpenBMB/BMInf/blob/45d0af959f8017ca78bc18e03a660daf77c46852/bminf/models/cpm2.py#L55
@a710128 Could you share your installation script and cuda version?
pip install bminf
CUDA 11.1
@a710128 Actually the previous logs are not matched with my latest runs. The error actually comes from the T5 model file during the first init of the pinned decoder layer.
code: self.dec_layers[i].init_data(pinned=True)
This could explain why the CPM1 works fine but EVA and CPM2 are not.
Here is my script
import bminf
from cpm_kernels.library import cudart
print(bminf.__version__)
print(cudart.cudaGetDeviceCount())
print(cudart.cudaRuntimeGetVersion())
print(cudart.cudaDriverGetVersion())
cpm2 = bminf.models.CPM2()
And the output is:
1.0.1
1
10020
11060
Traceback (most recent call last):
File "/home/sdjksdafji/repo/mira/bminf-backend/debug.py", line 10, in <module>
cpm2 = bminf.models.CPM2()
File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/models/cpm2.py", line 57, in __init__
self._model = T5Model(config)
File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/arch/t5/model.py", line 107, in __init__
self.dec_layers[i].init_data(pinned=True)
File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/core/layer.py", line 158, in init_data
ptr = cudart.cudaMallocHost(self.nbytes)
File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/cpm_kernels/library/base.py", line 94, in wrapper
return f(*args, **kwargs)
File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/cpm_kernels/library/cudart.py", line 385, in cudaMallocHost
checkCUDAStatus(cuda.cudaMallocHost(ctypes.byref(ptr), size))
File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/cpm_kernels/library/cudart.py", line 327, in checkCUDAStatus
raise RuntimeError("CUDA Runtime Error: %s" % cudaGetErrorString(error))
RuntimeError: CUDA Runtime Error: out of memory
Process finished with exit code 1
@sdjksdafji Try BMInf 1.0.2
@a710128 Thanks for the fix. I tried 1.0.2. The import works fine for me but the inference does not. Here is the latest error:
>>> import bminf
>>> cpm2 = bminf.models.CPM2()
Downloading cpm2.1-new/checkpoint.pt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.3G/11.3G [32:51<00:00, 5.73MiB/s]
Downloading cpm2.1-new/vocab.txt: 160kiB [00:00, 2.72MiB/s]
>>> text = "北京环球度假区相关负责人介绍,北京环球影城指定单日门票将采用<span>制度,即推出淡季日、平季日、旺季日和特定日门票。<span>价格为418元,<span>价格为528元,<span>价格为638元,<span>价格为<span>元。北京环球度假区将提供90天滚动价格日历,以方便游客提前规划行程。"
>>> for result in cpm2.fill_blank(text,
... top_p=1.0,
... top_n=5,
... temperature=0.5,
... frequency_penalty=0,
... presence_penalty=0
... ):
... value = result["text"]
... text = text.replace("<span>", "\033[0;32m" + value + "\033[0m", 1)
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/models/cpm2.py", line 245, in fill_blank
for token in res:
File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/models/cpm2.py", line 135, in _gen_iter
self._model.encode(
File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/arch/t5/model.py", line 195, in encode
layer.forward(
File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/layers/transformer_block.py", line 34, in forward
self.self_attn.forward(ctx, x_mid, x_mid, mask, position_bias, x_mid)
File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/layers/attention.py", line 42, in forward
self.project_q.forward(ctx, hidden_q, h_q)
File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/bminf/layers/linear.py", line 43, in forward
ck.gemm_int8(
File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/cpm_kernels/kernels/gemm.py", line 172, in gemm_int8
cublaslt.cublasLtMatmul(
File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/cpm_kernels/library/base.py", line 94, in wrapper
return f(*args, **kwargs)
File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/cpm_kernels/library/cublaslt.py", line 137, in cublasLtMatmul
checkCublasStatus(cublasLt.cublasLtMatmul(
File "/home/sdjksdafji/miniconda3/envs/bminfnew/lib/python3.8/site-packages/cpm_kernels/library/cublaslt.py", line 98, in checkCublasStatus
raise RuntimeError("CUBLAS error: {}".format(
RuntimeError: CUBLAS error: CUBLAS_STATUS_EXECUTION_FAILED
BTW, I have some questions regarding the fix. Seems like the actual fix here is to use the non-cuda pinned numpy array if the cuda malloc operation fails. Even if it worked, would it affect the inference performance? I assume now the computation happens on CPU instead of GPU, right? Instead of a fix, to me, this sounds like a workaround with the sacrifice of perf. Shall we try to figure out the root cause of the failed CUDA malloc? My 3080 has 16g GPU-MEM so the OOM error definitely does not make sense.
Seems like the actual fix here is to use the non-cuda pinned numpy array if the cuda malloc operation fails. Even if it worked, would it affect the inference performance?I assume now the computation happens on CPU instead of GPU, right?
Even if a non-cuda pinned numpy array is used, the computation still happens on the GPU. The difference is that non-pinned memory spends more time transferring data from CPU to GPU.
Shall we try to figure out the root cause of the failed CUDA malloc?
I think the root cause of failed memory requests is because of some system limitations. Some operating systems limit the total size of pinned memory.
Describe the bug CUDA error was raised when importing models. This issue only happens with BMInf 1.0.x version. I could run BmInf 0.0.5 successfully. Any help would be appreciated. Thanks.
Minimal steps to reproduce Tried the following on both
WSL2 Ubuntu 20.04 with GTX 3080 16G
andnative Ubuntu 18.04 with GTX 1070 8G
Then run
Expected behavior Start downloading the model.
Screenshots
Environment: Tried with various cuda versions including 10.2 11.0 and 11.3