CarperAI / trlx

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)
MIT License
4.46k stars 471 forks source link

!deepspeed examples/summarize_rlhf/sft/train_gptj_summarize.py is failing #410

Open MyBruso opened 1 year ago

MyBruso commented 1 year ago

Bug Hello ,

I am trying to run summarize_rlhf example using this blog on wandb. This script is failing with attached logs; however I am not able to locate the actual issue. Error snapshot - Loading extension module utils... Time to load utils op: 18.591187238693237 seconds Rank: 0 partition count [1] and sizes[(6050882784, False)] [2023-04-01 14:17:09,378] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3605 [2023-04-01 14:17:09,383] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3', '-u', 'examples/summarize_rlhf/sft/train_gptj_summarize.py', '--local_rank=0'] exits with return code = -9

Detailed error - error.txt My enviornment looks like this - pip_freeze.txt

Can someone guide me in investigating this issue? Is there any other way to get more diagnostic information?

Note : Before this, I encountered base64 decoding issue (padding error) with ds_config_gptj.json so I converted it to dict and added in the train_gptj_summarize.py.

Which trlX version are you using? Editable install with no version control (trlx==0.6.0)

Installed using pip install - e . using source

PhungVanDuy commented 1 year ago

Bug Hello ,

I am trying to run summarize_rlhf example using this blog on wandb. This script is failing with attached logs; however I am not able to locate the actual issue. Error snapshot - Loading extension module utils... Time to load utils op: 18.591187238693237 seconds Rank: 0 partition count [1] and sizes[(6050882784, False)] [2023-04-01 14:17:09,378] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3605 [2023-04-01 14:17:09,383] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3', '-u', 'examples/summarize_rlhf/sft/train_gptj_summarize.py', '--local_rank=0'] exits with return code = -9

Detailed error - error.txt My enviornment looks like this - pip_freeze.txt

Can someone guide me in investigating this issue? Is there any other way to get more diagnostic information?

Note : Before this, I encountered base64 decoding issue (padding error) with ds_config_gptj.json so I converted it to dict and added in the train_gptj_summarize.py.

Which trlX version are you using? Editable install with no version control (trlx==0.6.0)

Installed using pip install - e . using source

Can you try to rerun with one GPU to see the details of the error? With the log file error you sent above, I cannot catch the error. Thank you!

MyBruso commented 1 year ago

Hello @PhungVanDuy , I am trying this experiment on colab, so I am not sure if I can select a single GPU instance. Is there any other way to specify GPU instances or some other configuration to get more detailed issue?

Rerun failed with same error.

superhg commented 1 year ago

export CUDA_VISIBLE_DEVICES=0 deepspeed --num_gpus=1 xxxx.py

MyBruso commented 1 year ago

Thank you @superhg , I tried your suggestion for the next run.

Some other changes before this run -

  1. changed numpy version to 1.23.2
  2. train_batch_size = 32 (from earlier 128)

It seems the script failed with similar error -

[2023-04-03 07:00:36,886] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-03 07:00:36,898] [INFO] [runner.py:550:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None examples/summarize_rlhf/sft/train_gptj_summarize.py [2023-04-03 07:00:39,489] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.16.2-1+cuda11.8 [2023-04-03 07:00:39,489] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.16.2-1 [2023-04-03 07:00:39,489] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.16.2-1 [2023-04-03 07:00:39,489] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev [2023-04-03 07:00:39,489] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.16.2-1+cuda11.8 [2023-04-03 07:00:39,489] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2 [2023-04-03 07:00:39,489] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.16.2-1 [2023-04-03 07:00:39,489] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]} [2023-04-03 07:00:39,489] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-04-03 07:00:39,489] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-04-03 07:00:39,489] [INFO] [launch.py:162:main] dist_world_size=1 [2023-04-03 07:00:39,489] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0 2023-04-03 07:00:43.439126: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Found cached dataset parquet (/root/.cache/huggingface/datasets/CarperAI_parquet/CarperAI--openai_summarizetldr-536d9955f5e6f921/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec) Found cached dataset parquet (/root/.cache/huggingface/datasets/CarperAIparquet/CarperAI--openai_summarize_tldr-536d9955f5e6f921/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec) [2023-04-03 07:02:41,433] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Installed CUDA version 11.8 does not match the version torch was compiled with 11.6 but since the APIs are compatible, accepting this combination Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root... Creating extension directory /root/.cache/torch_extensions/py39_cu116/cpu_adam... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py39_cu116/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.9/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.9/dist-packages/torch/include -isystem /usr/local/lib/python3.9/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.9/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.9/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -UCUDA_NO_HALF_OPERATORS -UCUDA_NO_HALF_CONVERSIONS -UCUDA_NO_HALF2_OPERATORS -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /usr/local/lib/python3.9/dist-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o [2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.9/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.9/dist-packages/torch/include -isystem /usr/local/lib/python3.9/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.9/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.9/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -DAVX512 -DENABLE_CUDA -c /usr/local/lib/python3.9/dist-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o [3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/usr/local/lib/python3.9/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so Loading extension module cpu_adam... Time to load cpu_adam op: 36.96661448478699 seconds Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root... Creating extension directory /root/.cache/torch_extensions/py39_cu116/utils... Emitting ninja build file /root/.cache/torch_extensions/py39_cu116/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.9/dist-packages/torch/include -isystem /usr/local/lib/python3.9/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.9/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.9/dist-packages/torch/include/THC -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /usr/local/lib/python3.9/dist-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o [2/2] c++ flatten_unflatten.o -shared -L/usr/local/lib/python3.9/dist-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so Loading extension module utils... Time to load utils op: 19.042503595352173 seconds Rank: 0 partition count [1] and sizes[(6050882784, False)] [2023-04-03 07:04:47,750] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 13658 [2023-04-03 07:04:47,751] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3', '-u', 'examples/summarize_rlhf/sft/train_gptj_summarize.py', '--local_rank=0'] exits with return code = -9

MyBruso commented 1 year ago

Hello @PhungVanDuy Do you have any suggestion on how do I get this working? Would it be due to python packages version mismatch?

PhungVanDuy commented 1 year ago

Hello @PhungVanDuy Do you have any suggestion on how do I get this working? Would it be due to python packages version mismatch?

I have no idea given this log, I guess it maybe our of memory, please try to reduce the batch size to 8 or something. This log does not show any clear error.

MyBruso commented 1 year ago

okay, I have tried reducing it till 32. Let me try again with 8.

xin-li-67 commented 1 year ago

Well, I came across this issue too. I was using a single 4090 and changed the batch size to 8 and it still failed. It seemed that it was not related to the GPT OOM problem.