Open 0x7o opened 2 years ago
Sorry, we cannot reproduce this problem from our side. Can you try using gdb or add some debug message to find the reason?
(gdb) run
Starting program: /workspace/FasterTransformer/build/bin/gptj_example
warning: Error disabling address space randomization: Operation not permitted
warning: Probes-based dynamic linker interface failed.
Reverting to original interface.
process 9390 is executing new program: /workspace/FasterTransformer/build/bin/gptj_example
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 9394]
[New Thread 0x7f1035d91000 (LWP 9398)]
[New Thread 0x7f10353a7000 (LWP 9399)]
Total ranks: 1.
[New Thread 0x7f102fdbb000 (LWP 9400)]
Device NVIDIA Tesla K80
P0 is runing with 0 GPU.
[INFO] Setting tensor_para_size to 1
[New Thread 0x7f102f592000 (LWP 9401)]
[New Thread 0x7f102ebd2000 (LWP 9402)]
[New Thread 0x7f102e3d1000 (LWP 9403)]
[Thread 0x7f102ebd2000 (LWP 9402) exited]
[New Thread 0x7f1023fff000 (LWP 9404)]
[New Thread 0x7f10237fe000 (LWP 9405)]
[Thread 0x7f102e3d1000 (LWP 9403) exited]
[New Thread 0x7f1022ffd000 (LWP 9406)]
[New Thread 0x7f10227fc000 (LWP 9407)]
Thread 1 "gptj_example" received signal SIGSEGV, Segmentation fault.
0x00007fac15af66f2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
(gdb) backtrace
#0 0x00007f1035f606f2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#1 0x00007f1035fe8a76 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2 0x00007f10424fa3a5 in ?? () from /usr/local/cuda/lib64/libcudart.so.11.0
#3 0x00007f10425390c6 in cudaMemPoolSetAccess () from /usr/local/cuda/lib64/libcudart.so.11.0
#4 0x000055e32289dde6 in fastertransformer::Allocator<(fastertransformer::AllocatorType)0>::Allocator(int) ()
#5 0x000055e3228a15d0 in void gptj_example<float>(INIReader) ()
#6 0x000055e32288ee77 in main ()
Does /usr/local/cuda/lib64/libcudart.so.11.0
link to other file?
root@abcbe3e329ca:/workspace/FasterTransformer/build# ls -l /usr/local/cuda/lib64/libcudart.so.11.0
lrwxrwxrwx 1 root root 20 May 28 2021 /usr/local/cuda/lib64/libcudart.so.11.0 -> libcudart.so.11.4.43
root@abcbe3e329ca:/workspace/FasterTransformer/build#
How about define the macro CUDA_MEMORY_POOL_DISABLED
in allocator.h
directly?
I don't understand you.
Change https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/utils/allocator.h#L126 to #if 1
directly.
NCCL_LAUNCH_MODE=GROUP mpirun -n 2 --allow-run-as-root ./bin/gptj_example
Total ranks: 2.
Device NVIDIA Tesla K80
P1 is runing with 1 GPU.
[INFO] Setting tensor_para_size to 2
Device NVIDIA Tesla K80
P0 is runing with 0 GPU.
[INFO] Setting tensor_para_size to 2
[FT][WARNING] Async cudaMalloc/Free is not supported before CUDA 11.2. Using Sync cudaMalloc/Free.Note this may lead to hang with NCCL kernels launched in parallel; if so, try NCCL_LAUNCH_MODE=GROUP
terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: operation not supported /workspace/FasterTransformer/src/fastertransformer/utils/allocator.h:181
[abcbe3e329ca:13275] *** Process received signal ***
[abcbe3e329ca:13275] Signal: Aborted (6)
[abcbe3e329ca:13275] Signal code: (-6)
[abcbe3e329ca:13275] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fc850260420]
[abcbe3e329ca:13275] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fc84fd4f00b]
[abcbe3e329ca:13275] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fc84fd2e859]
[abcbe3e329ca:13275] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7fc850106911]
[abcbe3e329ca:13275] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7fc85011238c]
[abcbe3e329ca:13275] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7fc8501123f7]
[abcbe3e329ca:13275] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7fc8501126a9]
[abcbe3e329ca:13275] [ 7] ./bin/gptj_example(+0x19389)[0x56484a036389]
[abcbe3e329ca:13275] [ 8] ./bin/gptj_example(+0x1e1f6)[0x56484a03b1f6]
[abcbe3e329ca:13275] [ 9] ./bin/gptj_example(+0x8de63)[0x56484a0aae63]
[abcbe3e329ca:13275] [10] ./bin/gptj_example(+0x201e1)[0x56484a03d1e1]
[abcbe3e329ca:13275] [11] ./bin/gptj_example(+0xde17)[0x56484a02ae17]
[abcbe3e329ca:13275] [12] terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: operation not supported /workspace/FasterTransformer/src/fastertransformer/utils/allocator.h:181
[abcbe3e329ca:13274] *** Process received signal ***
mpirun -n 2 --allow-run-as-root ./bin/gptj_example
Total ranks: 2.
Device NVIDIA Tesla K80
P0 is runing with 0 GPU.
[INFO] Setting tensor_para_size to 2
Device NVIDIA Tesla K80
P1 is runing with 1 GPU.
[INFO] Setting tensor_para_size to 2
[FT][WARNING] Async cudaMalloc/Free is not supported before CUDA 11.2. Using Sync cudaMalloc/Free.Note this may lead to hang with NCCL kernels launched in parallel; if so, try NCCL_LAUNCH_MODE=GROUP
terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: operation not supported /workspace/FasterTransformer/src/fastertransformer/utils/allocator.h:181
[abcbe3e329ca:13250] *** Process received signal ***
It works, but I got a cuda OOM error. Is there any way to load the neural network in parts on multiple GPUs?
I remember the K80 has 24 GB memory, and you use 2-way tensor parallel. It should be able to load the model. Can you post the error?
Yes it does, but two 12gb chips. It feels like the script loads two neural networks on two chips.
Can you post the log?
Description
Tesla K80. Cuda 11.3. CudNN 8.2.
Reproduced Steps