Open DesperadoDQY opened 1 year ago
main
/docker/Dockerfile.torch
A100
470.129.06
./bin/gptneox_example model:GPT-Neox-20B batchsize=8,seqlenin=256,seqlenout =512,fp16 Total ranks: 1. Device NVIDIA A100-SXM4-80GB P0 is running with GPU #0. [FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1. start_id : /home/zjf/FasterTransformer/examples/cpp/gptneox/start_ids_8.csv [WARNING] gemm_config.in is not found; using default GEMM algo after allocation : free: 40.15 GB, total: 79.35 GB, used: 39.20 GB terminate called after throwing an instance of 'std::runtime_error' what(): [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_EXECUTION_FAILED /home/zjf/workspace/FasterTransformer/src/fastertransformer/utils/cublasMMWrapper.cc:115 [ml-a100-ser160:3763638] *** Process received signal *** [ml-a100-ser160:3763638] Signal: Aborted (6) [ml-a100-ser160:3763638] Signal code: (-6) [ml-a100-ser160:3763638] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f6b32834420] [ml-a100-ser160:3763638] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f6b3231d00b] [ml-a100-ser160:3763638] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f6b322fc859] [ml-a100-ser160:3763638] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f6b326d4911] [ml-a100-ser160:3763638] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f6b326e038c] [ml-a100-ser160:3763638] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7f6b326e03f7] [ml-a100-ser160:3763638] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7f6b326e06a9] [ml-a100-ser160:3763638] [ 7] ./bin/gptneox_example(+0x1f738)[0x55bcac460738] [ml-a100-ser160:3763638] [ 8] ./bin/gptneox_example(+0x23d9a7)[0x55bcac67e9a7] [ml-a100-ser160:3763638] [ 9] ./bin/gptneox_example(+0x88a95)[0x55bcac4c9a95] [ml-a100-ser160:3763638] [10] ./bin/gptneox_example(+0x6431c)[0x55bcac4a531c] [ml-a100-ser160:3763638] [11] ./bin/gptneox_example(+0x2b15f)[0x55bcac46c15f] [ml-a100-ser160:3763638] [12] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f6b322fe083] [ml-a100-ser160:3763638] [13] ./bin/gptneox_example(+0x4cf8e)[0x55bcac48df8e] [ml-a100-ser160:3763638] *** End of error message *** Aborted
fusedQKV_masked_attention_dispatch will generate nan when used fp16.
Branch/Tag/Commit
main
Docker Image Version
/docker/Dockerfile.torch
GPU name
A100
CUDA Driver
470.129.06
Reproduced Steps