Tencent / TurboTransformers

a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.
Other
1.49k stars 198 forks source link

[TT_ERROR] CUDA runtime error: an illegal memory access was encountered TurboTransformers/turbo_transformers/core/cuda_device_context.cpp:33 #191

Open auspicious3000 opened 4 years ago

auspicious3000 commented 4 years ago

Below is a rough code to explain what I did.

import torch.multiprocessing as mp

def inference(config):
    data_loader = get_loader(config)
    while True:
        for step in range(128):
            dec_outs, _ = turbo_decoder(current_pred, 
                                        memory_bank, 
                                        step, 
                                        memory_lengths=memory_lengths)

ctx = mp.get_context("spawn")
p = ctx.Process(target=inference, args=(config))
p.start()
p.join()
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

I got the above error when using turbo_decoder to generate data for training. The error could appear at any iteration. Sometimes after calling inference hundreds of times, sometimes after calling it thousands of times. It looks similar to #174, but I have not found the real solution for two days.

Hopefully you could shed some light on this.

feifeibear commented 4 years ago

Did huggingface's decoder work for you?

auspicious3000 commented 4 years ago

I only tried opennmt transformer decoder, because I have many customized operations and the opennmt looks more flexible.

feifeibear commented 4 years ago

So, your code works when using opennmt, but is failed after switching to turbo?

auspicious3000 commented 4 years ago

Yes. Opennmt decoder works fine. I just used the transformer decoder similar to the one in "Attention is all you need". Basically, I used the transformer decoder in modeling_decoders.py, and initialized it using the from_onmt method.

feifeibear commented 4 years ago

It looks that you are using turbo correctly. It maybe some bad cases in Turbo decoder. Can you run OpenNMT decoder and Turbo simultaneously? And check the results of two decoders at each step.

auspicious3000 commented 4 years ago

I have done that. The results match correctly. However, when you call inference many times, it may break at some point.

Specifically, it looks like the inference breaks at the while loop level rather than the inner for loop level as shown in the example code above.

feifeibear commented 4 years ago

Did you see the Device memory usage? The crush may result from excessive GPU memory consumption. nvidia-smi --id=0 --query-compute-apps=used_memory --format=csv -lms 100

auspicious3000 commented 4 years ago

I checked that as well. Only used less than 50%.

auspicious3000 commented 4 years ago

By the way, I found after removing the "with torch.no_grad" over the inference function, it is less likely to crash and runs longer. Not sure if this information is helpful.

feifeibear commented 4 years ago

I have some hints for you to debug.

  1. use CPU only to check if the results are always the same with ONMT.
  2. Try to upgrade your PyTorch version since I am not sure the error is thrown from the turbo.
auspicious3000 commented 4 years ago

又debug了两天,gcc版本,pytorch版本,docker镜像,都换过了,其它能想到的一些方法也试了,只要调用次数一多,还是会报错。 基本上发现是,对于turbo decoder的输入和输出做的操作越少,跑得越久,输入和输出操作前都要deepcopy,否则很快就报错了。但是这样也没有从根本上解决问题,最多调用一万次左右,还是会报错。我猜可能是turbo本身不稳定,反复多次调用容易出问题?

feifeibear commented 4 years ago

有可能是内存管理的第三方库cub不稳定。 你把naive allocator里面都改成在显存里直接分配试试。

return allocate_impl(size, kDLGPU);
allocate_free(mem, kDLGPU);

https://github.com/Tencent/TurboTransformers/blob/master/turbo_transformers/core/allocator/naive_allocator.h#L48 https://github.com/Tencent/TurboTransformers/blob/master/turbo_transformers/core/allocator/naive_allocator.h#L73

auspicious3000 commented 4 years ago

请问,我能否确认一下,是把49行到63行,以及74到80行,替换成您上面建议的这两行代码,然后重新编译吗?

feifeibear commented 4 years ago

是的

auspicious3000 commented 4 years ago

报错了…… 求助~

[180/270] Building CXX object turbo_transformers...eFiles/tt_core.dir/allocator/allocator_api.cpp.o
FAILED: turbo_transformers/core/CMakeFiles/tt_core.dir/allocator/allocator_api.cpp.o
/usr/bin/c++  -DLOGURU_WITH_STREAMS=1 -DTT_BLAS_USE_MKL -DTT_WITH_CUDA -D__CLANG_SUPPORT_DYN_ANNOTATION__ -I/usr/local/cuda/include -I/mnt/TurboTransformers/3rd/cub -I/mnt/TurboTransformers/3rd/FP16/include -I/mnt/TurboTransformers -I/opt/miniconda3/include -I/mnt/TurboTransformers/3rd/abseil -I/mnt/TurboTransformers/3rd/dlpack/include -I/mnt/TurboTransformers/3rd/loguru -Wall -m64 -fopenmp -O3 -DNDEBUG -fPIC   -std=gnu++14 -MD -MT turbo_transformers/core/CMakeFiles/tt_core.dir/allocator/allocator_api.cpp.o -MF turbo_transformers/core/CMakeFiles/tt_core.dir/allocator/allocator_api.cpp.o.d -o turbo_transformers/core/CMakeFiles/tt_core.dir/allocator/allocator_api.cpp.o -c /mnt/TurboTransformers/turbo_transformers/core/allocator/allocator_api.cpp
In file included from /mnt/TurboTransformers/turbo_transformers/core/allocator/allocator_api.cpp:21:0:
/mnt/TurboTransformers/turbo_transformers/core/allocator/model_aware_allocator.h: In constructor 'turbo_transformers::core::allocator::ModelAwareAllocator::ModelAwareAllocator(const string&)':
/mnt/TurboTransformers/turbo_transformers/core/allocator/model_aware_allocator.h:118:13: warning: 'turbo_transformers::core::allocator::ModelAwareAllocator::gpu_chunk_list_' will be initialized after [-Wreorder]
   ChunkList gpu_chunk_list_;
             ^~~~~~~~~~~~~~~
/mnt/TurboTransformers/turbo_transformers/core/allocator/model_aware_allocator.h:114:13: warning:   'turbo_transformers::core::allocator::ChunkList turbo_transformers::core::allocator::ModelAwareAllocator::cpu_chunk_list_' [-Wreorder]
   ChunkList cpu_chunk_list_;
             ^~~~~~~~~~~~~~~
/mnt/TurboTransformers/turbo_transformers/core/allocator/model_aware_allocator.h:34:12: warning:   when initialized here [-Wreorder]
   explicit ModelAwareAllocator(const std::string& model_name)
            ^~~~~~~~~~~~~~~~~~~
In file included from /mnt/TurboTransformers/turbo_transformers/core/allocator/allocator_api.cpp:22:0:
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h: In member function 'virtual void* turbo_transformers::core::allocator::NaiveAllocator::allocate(size_t, DLDeviceType, const string&)':
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h:50:21: error: 'mem' was not declared in this scope
       allocate_free(mem, kDLGPU);
                     ^~~
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h:50:7: error: 'allocate_free' was not declared in this scope
       allocate_free(mem, kDLGPU);
       ^~~~~~~~~~~~~
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h:50:7: note: suggested alternative: 'allocate_impl'
       allocate_free(mem, kDLGPU);
       ^~~~~~~~~~~~~
       allocate_impl
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h: In member function 'virtual void turbo_transformers::core::allocator::NaiveAllocator::free(void*, DLDeviceType, const string&)':
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h:61:28: error: 'size' was not declared in this scope
       return allocate_impl(size, kDLGPU);
                            ^~~~
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h:61:28: note: suggested alternative: 'dysize'
       return allocate_impl(size, kDLGPU);
                            ^~~~
                            dysize
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h:61:40: error: return-statement with a value, in function returning 'void' [-fpermissive]
       return allocate_impl(size, kDLGPU);
                                        ^
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h:62:7: error: 'allocate_free' was not declared in this scope
       allocate_free(mem, kDLGPU);
       ^~~~~~~~~~~~~
/mnt/TurboTransformers/turbo_transformers/core/allocator/naive_allocator.h:62:7: note: suggested alternative: 'allocate_impl'
       allocate_free(mem, kDLGPU);
       ^~~~~~~~~~~~~
       allocate_impl
[187/270] Building CXX object turbo_transformers...iles/catch2_test_main.dir/catch2_test_main.cpp.o
ninja: build stopped: subcommand failed.
feifeibear commented 4 years ago

free_impl(mem, kDLGPU); 看一下cpu的api,改一下kDLGPU

auspicious3000 commented 4 years ago

看来稳定性似乎确实和显存分配有关系,按照您的建议修改naive allocator以后,调用了四万多次才报错。 另外,观察到两个现象,第一个是,dataloader输出8个tensor,每个都有tensor.to(gpu)的操作,但其实目前只用到两个tensor,如果把剩下不用的tensor.to(gpu)操作去掉,会更稳定,跑得更久。 第二个是,调用encoder的时候,外面套上with torch.no_grad()会不稳定,但是如果同时把encoder输出都加上.data.clone(),就会好很多。 总得来说,目前报错的几率降低了很多,但是跑多了还是会报错,不知道是否还有其它方法可以改进多次调用的稳定性?

feifeibear commented 4 years ago

你观察一下你的显存消耗是否稳定。是不是很多没释放的内存逐步积累导致程序崩溃的。

auspicious3000 commented 4 years ago

这个我之前观察过,哪怕报错的瞬间,显存也都没有超过50%,我也试过大batch入队让显存溢出,那种情况下会直接报显存不足的错,而不是这种an illegal memory access was encountered 不过,虽然显存没有溢出,但是之前显存是会有少量增长的,增长很慢,可能是出队的速度赶不上入队的速度导致的

feifeibear commented 4 years ago

用cuda-memcheck检查一下内存使用情况吧

auspicious3000 commented 4 years ago

好的,谢谢指点~

auspicious3000 commented 4 years ago

每次程序开始之前都会打印这些内容,能问一下这代表什么吗?是否有可能和之前的报错有联系呢……

date       time         ( uptime  ) [ thread name/id ]                   file:line     v| 
2020-11-13 17:15:14.559 (   0.000s) [main thread     ]             loguru.cpp:610   INFO| arguments: turbo_transformers_cxx
2020-11-13 17:15:14.559 (   0.000s) [main thread     ]             loguru.cpp:613   INFO| Current dir: /mnt/parallel
2020-11-13 17:15:14.559 (   0.000s) [main thread     ]             loguru.cpp:615   INFO| stderr verbosity: 0
2020-11-13 17:15:14.559 (   0.000s) [main thread     ]             loguru.cpp:616   INFO| -----------------------------------
date       time         ( uptime  ) [ thread name/id ]                   file:line     v| 
2020-11-13 17:15:16.154 (   0.000s) [main thread     ]             loguru.cpp:610   INFO| arguments: turbo_transformers_cxx
2020-11-13 17:15:16.154 (   0.000s) [main thread     ]             loguru.cpp:613   INFO| Current dir: /mnt/parallel
2020-11-13 17:15:16.154 (   0.000s) [main thread     ]             loguru.cpp:615   INFO| stderr verbosity: 0
2020-11-13 17:15:16.154 (   0.000s) [main thread     ]             loguru.cpp:616   INFO| -----------------------------------
date       time         ( uptime  ) [ thread name/id ]                   file:line     v| 
2020-11-13 17:15:17.468 (   0.000s) [main thread     ]             loguru.cpp:610   INFO| arguments: turbo_transformers_cxx
2020-11-13 17:15:17.468 (   0.000s) [main thread     ]             loguru.cpp:613   INFO| Current dir: /mnt/parallel
2020-11-13 17:15:17.468 (   0.000s) [main thread     ]             loguru.cpp:615   INFO| stderr verbosity: 0
2020-11-13 17:15:17.468 (   0.000s) [main thread     ]             loguru.cpp:616   INFO| -----------------------------------
date       time         ( uptime  ) [ thread name/id ]                   file:line     v| 
2020-11-13 17:15:18.781 (   0.000s) [main thread     ]             loguru.cpp:610   INFO| arguments: turbo_transformers_cxx
2020-11-13 17:15:18.781 (   0.000s) [main thread     ]             loguru.cpp:613   INFO| Current dir: /mnt/parallel
2020-11-13 17:15:18.781 (   0.000s) [main thread     ]             loguru.cpp:615   INFO| stderr verbosity: 0
2020-11-13 17:15:18.781 (   0.000s) [main thread     ]             loguru.cpp:616   INFO| -----------------------------------
date       time         ( uptime  ) [ thread name/id ]                   file:line     v| 
2020-11-13 17:15:18.793 (   0.000s) [main thread     ]             loguru.cpp:610   INFO| arguments: turbo_transformers_cxx
2020-11-13 17:15:18.793 (   0.000s) [main thread     ]             loguru.cpp:613   INFO| Current dir: /mnt/parallel
2020-11-13 17:15:18.793 (   0.000s) [main thread     ]             loguru.cpp:615   INFO| stderr verbosity: 0
2020-11-13 17:15:18.793 (   0.000s) [main thread     ]             loguru.cpp:616   INFO| -----------------------------------
date       time         ( uptime  ) [ thread name/id ]                   file:line     v| 
2020-11-13 17:15:18.802 (   0.000s) [main thread     ]             loguru.cpp:610   INFO| arguments: turbo_transformers_cxx
2020-11-13 17:15:18.803 (   0.000s) [main thread     ]             loguru.cpp:613   INFO| Current dir: /mnt/parallel
2020-11-13 17:15:18.803 (   0.000s) [main thread     ]             loguru.cpp:615   INFO| stderr verbosity: 0
2020-11-13 17:15:18.803 (   0.000s) [main thread     ]             loguru.cpp:616   INFO| -----------------------------------
2020-11-13 17:15:20.499 (   1.705s) [main thread     ]             loguru.cpp:489   INFO| atexit
2020-11-13 17:15:20.956 (   2.175s) [main thread     ]             loguru.cpp:489   INFO| atexit
2020-11-13 17:15:20.995 (   2.192s) [main thread     ]             loguru.cpp:489   INFO| atexit
2020-11-13 17:15:22.328 (   4.859s) [main thread     ]             loguru.cpp:489   INFO| atexit
feifeibear commented 4 years ago

没什么关系,可以用一下语句关闭 turbo_transformers.set_stderr_verbose_level(0)

aihebaikaishui commented 3 years ago

您好!这边我按照您的建议将naive_allocator.h文件中的49行到63行,以及74到80行分别改成return allocate_impl(size, kDLGPU); 以及free_impl(mem, kDLGPU); 在编译完后,运行时仍然在相同的地方出现an illegal memory access was encountered的问题,并没有效果,是不是有其他地方需要进行修改呢?谢谢!

feifeibear commented 3 years ago

您好!这边我按照您的建议将naive_allocator.h文件中的49行到63行,以及74到80行分别改成return allocate_impl(size, kDLGPU); 以及free_impl(mem, kDLGPU); 在编译完后,运行时仍然在相同的地方出现an illegal memory access was encountered的问题,并没有效果,是不是有其他地方需要进行修改呢?谢谢!

你也是跑了个很多step才遇到这个问题吧?

aihebaikaishui commented 3 years ago

我这边只跑了几百个step就遇到这种问题了,然后根据您的建议,修改了上述代码,还是没有效果,除了return allocate_impl(size, kDLGPU); 以及free_impl(mem, kDLGPU);这两处,还有其他地方需要修改吗?谢谢!

feifeibear commented 3 years ago

我这边只跑了几百个step就遇到这种问题了,然后根据您的建议,修改了上述代码,还是没有效果,除了return allocate_impl(size, kDLGPU); 以及free_impl(mem, kDLGPU);这两处,还有其他地方需要修改吗?谢谢!

估计是multiheadedattention的CUDA实现有内存泄露,你能抽出一个简单的单测给我debug么?

aihebaikaishui commented 3 years ago

这边我们公司的电脑没法拷贝资料出去,我可能需要在自己的电脑上看看能不能将这个错误复现出来,谢谢哈!

feifeibear commented 3 years ago

可以随机初始化一个encoder-decoder模型,然后强制它decoder跑10000个step看看。因为我测试的decoder顶多一百多个step,可能没有发现内存泄露的问题。

aihebaikaishui commented 3 years ago

您好!这边非法内存问题(an illegal memory access was encountered)在对随机一句话进行decoder时貌似很难出现,这边尝试将step增加到几千,并无发现问题。这个问题在我这边主要出现的情形是:循环遍历一定量的语句并对其翻译时,当循环到一定数量时,便会出现非法内存的问题。这种现象可能在运行了几十句之后,也可能在运行到几百句子之后出现,也有可能运行到结束时也不会出现。

feifeibear commented 3 years ago

@aihebaikaishui @auspicious3000 我刚刚尝试修复了一下multiheaded attention中内存泄露的bug。你们如果方便的话,可以拉一下develop分支测试一下。测试过程,不需要改allocator.cpp代码。

auspicious3000 commented 3 years ago

辛苦了,我这几天找时间测试一下

auspicious3000 commented 3 years ago

@feifeibear 又各种试了一个多星期,发现问题依然存在,只不过报错出现在stdout里面,而不是stderr里面。错误如下

[TT_ERROR] CUDA runtime error: an illegal memory access was encountered /mnt/TurboTransformers/turbo_transformers/core/allocator/allocator_impl.cpp:29 

[TT_ERROR] CUDA runtime error: an illegal memory access was encountered /mnt/TurboTransformers/turbo_transformers/core/cuda_device_context.cpp:33

根据经验,发现当memory_bank比较长的时候非常容易报错,memory_bank短的话相对来讲可以跑很久。每次报错的时候显存占用三四千MiB,总的显存是16160MiB。

Turbotransformer速度很快非常好用,对我们的项目帮助很大,真心希望将来这个bug能被修复。

feifeibear commented 3 years ago

Hi @auspicious3000 ,我希望帮助你解决这个bug,根据你的描述似乎是turbo在处理大的memory_bank时候有bug。 你能给一下你程序的memory_bank最大尺寸么?

auspicious3000 commented 3 years ago

memory_bank 最长可以达到806,维度是256,batch_size我试过1,4,8,16,但是这一点好像没什么关系。 我之所以是怀疑是memory bank大小的问题,是因为我试了两个版本的data loader,第一个输入长度范围是64到192均匀分布,此时跑turbo的时候内存占用小,内存变化幅度小,基本不会报错; 第二个输入长度范围是64到806非均匀分布,大部分靠近64那头,此时跑turbo的时候内存变化幅度比较大,而且很容易就会报错。 这些都是我观察到的表面现象,不知道有没有关联。如果需要可复现Bug的代码我也可以提供,但是需要一点时间。而且代码本身不会太简单。

补充一点:如果不对turbo的输出做任何后续操作,那样貌似也不报错,但一旦对输出做任何操作,比如做加减乘除,就会报错。

感谢大神~

feifeibear commented 3 years ago

memory_bank (FloatTensor): (batch_size, src_len, model_dim) 是张量,最长可以达到806(src_len),维度是256(model_dim)? input_tensor (FloatTensor): (batch_size, T, model_dim) 你给一下memory_bank和input_tensor的最大尺寸可以么

auspicious3000 commented 3 years ago

memory_bank: (80, 806, 256) input: (80, 806, 256)

encoder是一个普通的简单的4层的transformer encoder,和原论文上基本一致,直接用的openmnt的encoder input是encoder输入 memory_bank是encoder输出 除此之外还有memory_length和input length之类的辅助输入和输出 输入是pad到当前batch里面最长token的长度

虽然最长可以达到806,但可能没到806的时候就已经挂了

MagiaSN commented 3 years ago

I have encountered similar issue, the process was terminated by SIGABRT randomly during inference. It was hard to reproduce, and I found the following tips were helpful for reproducing this issue:

  1. set PYTORCH_NO_CUDA_MEMORY_CACHING=1 if you are using pytorch together;
  2. set CUDA_LAUNCH_BLOCING=1 to disable asynchoronization in CUDA;
  3. run process under cuda-memcheck.

237 fixed this problem for me