Closed Gary-code closed 1 month ago
Zero3 开了CPU offload后报错
Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/4] /home/gary/miniconda3/bin/nvcc --generate-dependencies-with-compile --dependency-output custom_cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/gary/miniconda3/lib/python3.12/site-packages/deepspeed/ops/csrc/includes -I/home/gary/miniconda3/include -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include/TH -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include/THC -isystem /home/gary/miniconda3/include -isystem /home/gary/miniconda3/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -UCUDA_NO_HALF_OPERATORS -UCUDA_NO_HALF_CONVERSIONS -UCUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -UCUDA_NO_BFLOAT16_OPERATORS -UCUDA_NO_BFLOAT162_OPERATORS -c /home/gary/miniconda3/lib/python3.12/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o [2/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/gary/miniconda3/lib/python3.12/site-packages/deepspeed/ops/csrc/includes -I/home/gary/miniconda3/include -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include/TH -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include/THC -isystem /home/gary/miniconda3/include -isystem /home/gary/miniconda3/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/home/gary/miniconda3/lib -lcudart -lcublas -g -march=native -fopenmp -DAVX512 -DENABLE_CUDA -DBF16_AVAILABLE -c /home/gary/miniconda3/lib/python3.12/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o [3/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/gary/miniconda3/lib/python3.12/site-packages/deepspeed/ops/csrc/includes -I/home/gary/miniconda3/include -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include/TH -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include/THC -isystem /home/gary/miniconda3/include -isystem /home/gary/miniconda3/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/home/gary/miniconda3/lib -lcudart -lcublas -g -march=native -fopenmp -DAVX512 -DENABLE_CUDA__ -DBF16_AVAILABLE -c /home/gary/miniconda3/lib/python3.12/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o [4/4] c++ cpu_adam.o cpu_adam_impl.o custom_cuda_kernel.cuda.o -shared -lcurand -L/home/gary/miniconda3/lib/python3.12/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/home/gary/miniconda3/lib -lcudart -o cpu_adam.so
Loading extension module cpu_adam... Time to load cpu_adam op: 18.459045886993408 seconds Parameter Offload: Total persistent parameters: 2639600 in 486 params E0828 09:26:42.875000 140062897721728 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 0 (pid: 62999) of binary:
No response
- OS: - Python: 3.10 - Transformers: - PyTorch: 2.3 - CUDA (`python -c 'import torch; print(torch.version.cuda)'`): - 双卡NVIDIA 4090
这里似乎没有报错信息,训练是否中断了呢?
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
Zero3 开了CPU offload后报错
Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/4] /home/gary/miniconda3/bin/nvcc --generate-dependencies-with-compile --dependency-output custom_cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/gary/miniconda3/lib/python3.12/site-packages/deepspeed/ops/csrc/includes -I/home/gary/miniconda3/include -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include/TH -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include/THC -isystem /home/gary/miniconda3/include -isystem /home/gary/miniconda3/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -UCUDA_NO_HALF_OPERATORS -UCUDA_NO_HALF_CONVERSIONS -UCUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -UCUDA_NO_BFLOAT16_OPERATORS -UCUDA_NO_BFLOAT162_OPERATORS -c /home/gary/miniconda3/lib/python3.12/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o [2/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/gary/miniconda3/lib/python3.12/site-packages/deepspeed/ops/csrc/includes -I/home/gary/miniconda3/include -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include/TH -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include/THC -isystem /home/gary/miniconda3/include -isystem /home/gary/miniconda3/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/home/gary/miniconda3/lib -lcudart -lcublas -g -march=native -fopenmp -DAVX512 -DENABLE_CUDA -DBF16_AVAILABLE -c /home/gary/miniconda3/lib/python3.12/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o [3/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/gary/miniconda3/lib/python3.12/site-packages/deepspeed/ops/csrc/includes -I/home/gary/miniconda3/include -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include/TH -isystem /home/gary/miniconda3/lib/python3.12/site-packages/torch/include/THC -isystem /home/gary/miniconda3/include -isystem /home/gary/miniconda3/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/home/gary/miniconda3/lib -lcudart -lcublas -g -march=native -fopenmp -DAVX512 -DENABLE_CUDA__ -DBF16_AVAILABLE -c /home/gary/miniconda3/lib/python3.12/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o [4/4] c++ cpu_adam.o cpu_adam_impl.o custom_cuda_kernel.cuda.o -shared -lcurand -L/home/gary/miniconda3/lib/python3.12/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/home/gary/miniconda3/lib -lcudart -o cpu_adam.so
期望行为 | Expected Behavior
No response
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
备注 | Anything else?
No response