OpenGVLab / VideoMAEv2

[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
https://arxiv.org/abs/2303.16727
MIT License
527 stars 63 forks source link

Error when running runclass_finetuning.py #39

Closed Ravindu-Yasas-Nagasinghe closed 8 months ago

Ravindu-Yasas-Nagasinghe commented 1 year ago

Detected CUDA files, patching ldflags Emitting ninja build file /home/ravindu.nagasinghe/.cache/torch_extensions/py38_cu118/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/2] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/csrc/adam -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include/TH -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include/THC -isystem /home/ravindu.nagasinghe/.conda/envs/videomae/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -std=c++17 -c /home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o FAILED: multi_tensor_adam.cuda.o /usr/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/csrc/adam -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include/TH -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include/THC -isystem /home/ravindu.nagasinghe/.conda/envs/videomae/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -std=c++17 -c /home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o ERROR: No supported gcc/g++ host compiler found. Use 'nvcc -ccbin ' to specify a host compiler. ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build subprocess.run( File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "run_class_finetuning.py", line 927, in main(opts, ds_init) File "run_classfinetuning.py", line 727, in main model, optimizer, , _ = ds_init( File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/init.py", line 171, in initialize engine = DeepSpeedEngine(args=args, File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 303, in init self._configure_optimizer(optimizer, model_parameters) File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1202, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1264, in _configure_basic_optimizer optimizer = FusedAdam( File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init fused_adam_cuda = FusedAdamBuilder().load() File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load return self.jit_load(verbose) File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 489, in jit_load op_module = load(name=self.name, File "/home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile _write_ninja_file_and_build_library( File "/home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library _run_ninja_build( File "/home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'fused_adam'

Ravindu-Yasas-Nagasinghe commented 1 year ago

I get the above error when running run_class_finetuning.py. I am running for the 710 datataset.

congee524 commented 1 year ago

It seems that this is a deepspeed related error and you need to try to install deepspeed in the correct environment.

Ravindu-Yasas-Nagasinghe commented 1 year ago

Deep speed has already been correctly installed in the correct environment.

import deepspeed [2023-10-11 15:57:25,427] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

The issue occurs