Closed Deaddawn closed 11 months ago
With one gpu script:
deepspeed --include localhost:0 llamavid/train/train_mem.py \ --deepspeed ./scripts/zero2_offload.json \ --model_name_or_path ./work_dirs/llama-vid-7b-full-224-video-fps-1 \ --version imgsp_v1 \ --data_path /remote-home/LLaMA-VID-main/data/LLaMA-VID-Finetune/long_videoqa_part.json \ --image_folder ./data/LLaMA-VID-Finetune \ --video_folder ./data/LLaMA-VID-Finetune \ --vision_tower ./model_zoo/LAVIS/eva_vit_g.pth \ --image_processor ./llamavid/processor/clip-patch14-224 \ --pretrain_mm_mlp_adapter ./work_dirs/llama-vid-7b-pretrain-224-video-fps-1/mm_projector.bin \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length False \ --video_fps 1 \ --video_token 2 \ --bert_type "qformer_pretrain_freeze_all" \ --num_query 32 \ --compress_type "mean" \ --fp16 True \ --bf16 False \ --output_dir ./work_dirs/llama-vid-7b-full-224-long-video \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 5000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 False \ --model_max_length 65536 \ --gradient_checkpointing True \ --dataloader_num_workers 1 \ --lazy_preprocess True \ --report_to wandb
get this error:
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Freezing all qformer weights...
Loading pretrained weights...
Loading vlm_att_query weights...
Loading vlm_att_ln weights...
Formatting inputs...Skip in lazy mode
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.651790142059326 seconds
Rank: 0 partition count [1, 1] and sizes[(6760960000, False), (8192, False)]
wandb: Tracking run with wandb version 0.16.1
wandb: W&B syncing is set to offline
in this directory.
wandb: Run wandb online
or set WANDB_MODE=online to enable cloud syncing.
0%| | 0/8920 [00:00<?, ?it/s]Traceback (most recent call last):
File "/remote-home/songzhende/LLaMA-VID-main/llamavid/train/train_mem.py", line 13, in
Work just fine without using deepspeed
Hi, I guess maybe the memory (125G) is not enough for deepspeed zero2_offload. Welcome to open this issue again if need further support.
Hi, I guess maybe the memory (125G) is not enough for deepspeed zero2_offload. Welcome to open this issue again if need further support.
Yes, indeed. Can you give me your experiment hardware setting for reference?
This is what I got trying finetuning long video using stage_3_full_v7b_224_longvid.sh with two v100(32G), CPU(125G)
ERRROR INFO:
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Freezing all qformer weights... Freezing all qformer weights... Loading pretrained weights... Loading pretrained weights... Loading vlm_att_query weights... Loading vlm_att_ln weights... Loading vlm_att_query weights... Loading vlm_att_ln weights... Formatting inputs...Skip in lazy mode Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Creating extension directory /root/.cache/torch_extensions/py310_cu117/cpu_adam... Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/3] /usr/local/cuda-11.7/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/miniconda/envs/llamavid/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-11.7/include -isystem /root/miniconda/envs/llamavid/lib/python3.10/site-packages/torch/include -isystem /root/miniconda/envs/llamavid/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda/envs/llamavid/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda/envs/llamavid/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda-11.7/include -isystem /root/miniconda/envs/llamavid/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -UCUDA_NO_HALF_OPERATORS -UCUDA_NO_HALF_CONVERSIONS -UCUDA_NO_HALF2_OPERATORS -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /root/miniconda/envs/llamavid/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o [2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/miniconda/envs/llamavid/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-11.7/include -isystem /root/miniconda/envs/llamavid/lib/python3.10/site-packages/torch/include -isystem /root/miniconda/envs/llamavid/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda/envs/llamavid/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda/envs/llamavid/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda-11.7/include -isystem /root/miniconda/envs/llamavid/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda-11.7/lib64 -lcudart -lcublas -g -march=native -fopenmp -DAVX512 -DENABLE_CUDA -c /root/miniconda/envs/llamavid/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o [3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/root/miniconda/envs/llamavid/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda-11.7/lib64 -lcudart -o cpu_adam.so Loading extension module cpu_adam... Time to load cpu_adam op: 45.298378467559814 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 45.36060833930969 seconds Rank: 0 partition count [2, 2] and sizes[(3380480000, False), (4096, False)] Rank: 1 partition count [2, 2] and sizes[(3380480000, False), (4096, False)] [2023-12-29 02:26:03,598] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 202861 [2023-12-29 02:26:09,389] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 202862 [2023-12-29 02:26:09,389] [ERROR] [launch.py:321:sigkill_handler] ['/root/miniconda/envs/llamavid/bin/python3.1', '-u', 'llamavid/train/train_mem.py', '--local_rank=1', '--deepspeed', './scripts/zero2_offload.json', '--model_name_or_path', './work_dirs/llama-vid-7b-full-224-video-fps-1', '--version', 'imgsp_v1', '--data_path', '/remote-home/songzhende/LLaMA-VID-main/data/LLaMA-VID-Finetune/long_videoqa_part.json', '--image_folder', './data/LLaMA-VID-Finetune', '--video_folder', './data/LLaMA-VID-Finetune', '--vision_tower', './model_zoo/LAVIS/eva_vit_g.pth', '--image_processor', './llamavid/processor/clip-patch14-224', '--pretrain_mm_mlp_adapter', './work_dirs/llama-vid-7b-pretrain-224-video-fps-1/mm_projector.bin', '--mm_projector_type', 'mlp2x_gelu', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'False', '--video_fps', '1', '--video_token', '2', '--bert_type', 'qformer_pretrain_freeze_all', '--num_query', '32', '--compress_type', 'mean', '--fp16', 'True', '--bf16', 'False', '--output_dir', './work_dirs/llama-vid-7b-full-224-long-video', '--num_train_epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '5000', '--save_total_limit', '1', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'False', '--model_max_length', '65536', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '1', '--lazy_preprocess', 'True', '--report_to', 'wandb'] exits with return code = -9