cuda:12.1
pytorch:2.3.1
python:3.10
gpu:4 a800(4*80g)
ubuntu:22.04
apex is OK
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
[X] The official example scripts / 官方的示例脚本
[X] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
I use my own data to finetune it, I update the dataset.py.
my dataset.py:
import os
import logging
import random
import logging
import jsonlines
from io import BytesIO
from PIL import Image
from torch.utils.data import Dataset
from sat.helpers import print_rank0
import json
if filename in captions_dict:
# 返回对应的描述
return captions_dict[filename]
else:
# 如果文件名不存在,返回None或一个错误消息
return None # 或者 "Description not found for this filename."
def find_all_files(path, suffix=".jpg"):
target_files = []
for curdir, , files in os.walk(path, followlinks=True):
for f in files:
if f.endswith(suffix):
target_files.append(os.path.join(cur_dir, f))
print_rank0(f'find {len(target_files)} files...')
return target_files
System Info / 系統信息
cuda:12.1 pytorch:2.3.1 python:3.10 gpu:4 a800(4*80g) ubuntu:22.04 apex is OK
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
Reproduction / 复现过程
I use my own data to finetune it, I update the dataset.py.
my dataset.py: import os import logging import random import logging import jsonlines from io import BytesIO from PIL import Image from torch.utils.data import Dataset from sat.helpers import print_rank0 import json
captions_file = '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/captions.json'
加载captions.json文件
with open(captions_file, 'r', encoding='utf-8') as file: captions = json.load(file)
定义函数,根据图片文件名查找并返回描述
def find_caption_by_filename(filename, captions_dict):
检查文件名是否在captions_dict中
def find_all_files(path, suffix=".jpg"): target_files = [] for curdir, , files in os.walk(path, followlinks=True): for f in files: if f.endswith(suffix): target_files.append(os.path.join(cur_dir, f)) print_rank0(f'find {len(target_files)} files...') return target_files
class ItemDataset(Dataset): def init(self, image_processor, text_processor, args, data_dirs, cross_image_processor=None, **kwargs): super().init() self.data = self.load_data(data_dirs) self.image_processor, self.text_processor, self.cross_image_processor = image_processor, text_processor, cross_image_processor
my script:
! /bin/bash
export PATH=/GLOBALFS/dhu_mbzhao_1/cuda/bin:$PATH export LD_LIBRARY_PATH=/GLOBALFS/dhu_mbzhao_1/cuda/lib64:$LD_LIBRARY_PATH
NUM_GPUS_PER_WORKER=4 MP_SIZE=1
script_path=$(realpath $0) script_dir=$(dirname $script_path) main_dir=$(dirname $script_dir) MODEL_TYPE="cogvlm-chat-v1.1" VERSION="base" MODEL_ARGS="--from_pretrained $MODEL_TYPE \ --max_length 1288 \ --lora_rank 10 \ --use_lora \ --local_tokenizer /GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5 \ --version $VERSION"
Tips: If training models of resolution 244, you can set --max_length smaller
OPTIONS_SAT="SAT_HOME=/GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models" OPTIONS_NCCL="NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 LOCAL_WORLD_SIZE=$NUM_GPUS_PER_WORKER" HOST_FILE_PATH="hostfile"
train_data="./archive_split/train" valid_data="./archive_split/valid"
gpt_options=" \ --experiment-name finetune-$MODEL_TYPE \ --model-parallel-size ${MP_SIZE} \ --mode finetune \ --train-iters 800 \ --resume-dataloader \ $MODEL_ARGS \ --train-data ${train_data} \ --valid-data ${valid_data} \ --distributed-backend nccl \ --lr-decay-style cosine \ --warmup .02 \ --checkpoint-activations \ --vit_checkpoint_activations \ --save-interval 200 \ --eval-interval 200 \ --save "./checkpoints" \ --eval-iters 10 \ --eval-batch-size 1 \ --split 1. \ --deepspeed_config test_config_bf16.json \ --skip-init \ --seed 2023 "
run_cmd="${OPTIONS_NCCL} ${OPTIONS_SAT} deepspeed --master_port 16666 --hostfile ${HOST_FILE_PATH} finetune_cogvlm_demo.py ${gpt_options}" echo ${run_cmd} eval ${run_cmd}
set +x
Below is the log:
(cogvlm) dhu_mbzhao_1@deeplearning-v191204-deeplearn:~/CogVLM-main/finetune_demo$ sh finetune_cogvlm_lora.sh NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 LOCAL_WORLD_SIZE=4 SAT_HOME=/GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models deepspeed --master_port 16666 --hostfile hostfile finetune_cogvlm_demo.py --experiment-name finetune-cogvlm-chat-v1.1 --model-parallel-size 1 --mode finetune --train-iters 800 --resume-dataloader --from_pretrained cogvlm-chat-v1.1 --max_length 1288 --lora_rank 10 --use_lora --local_tokenizer /GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5 --version base --train-data ./archive_split/train --valid-data ./archive_split/valid --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --vit_checkpoint_activations --save-interval 200 --eval-interval 200 --save ./checkpoints --eval-iters 10 --eval-batch-size 1 --split 1. --deepspeed_config test_config_bf16.json --skip-init --seed 2023 [2024-07-18 15:03:39,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [2024-07-18 15:03:40,797] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-07-18 15:03:40,797] [INFO] [runner.py:568:main] cmd = /GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_cogvlm_demo.py --experiment-name finetune-cogvlm-chat-v1.1 --model-parallel-size 1 --mode finetune --train-iters 800 --resume-dataloader --from_pretrained cogvlm-chat-v1.1 --max_length 1288 --lora_rank 10 --use_lora --local_tokenizer /GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5 --version base --train-data ./archive_split/train --valid-data ./archive_split/valid --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --vit_checkpoint_activations --save-interval 200 --eval-interval 200 --save ./checkpoints --eval-iters 10 --eval-batch-size 1 --split 1. --deepspeed_config test_config_bf16.json --skip-init --seed 2023 [2024-07-18 15:03:42,018] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [2024-07-18 15:03:43,636] [INFO] [launch.py:139:main] 0 NCCL_DEBUG=info [2024-07-18 15:03:43,636] [INFO] [launch.py:139:main] 0 NCCL_IB_DISABLE=0 [2024-07-18 15:03:43,636] [INFO] [launch.py:139:main] 0 NCCL_NET_GDR_LEVEL=2 [2024-07-18 15:03:43,636] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2024-07-18 15:03:43,636] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=4, node_rank=0 [2024-07-18 15:03:43,636] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]}) [2024-07-18 15:03:43,636] [INFO] [launch.py:164:main] dist_world_size=4 [2024-07-18 15:03:43,636] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 [2024-07-18 15:03:43,637] [INFO] [launch.py:256:main] process 56061 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=0', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023'] [2024-07-18 15:03:43,637] [INFO] [launch.py:256:main] process 56062 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=1', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023'] [2024-07-18 15:03:43,637] [INFO] [launch.py:256:main] process 56063 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=2', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023'] [2024-07-18 15:03:43,638] [INFO] [launch.py:256:main] process 56064 spawned with command: ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=3', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023'] [2024-07-18 15:03:44,906] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-18 15:03:44,968] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-18 15:03:44,971] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-18 15:03:44,972] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [2024-07-18 15:03:49,192] [INFO] using world size: 4 and model-parallel size: 1 [2024-07-18 15:03:49,192] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128) [2024-07-18 15:03:49,192] [INFO] Will override arguments with manually specified deepspeed_config! [2024-07-18 15:03:49,326] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-18 15:03:49,331] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-18 15:03:49,353] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-18 15:03:49,361] [INFO] [RANK 0] > initializing model parallel with size 1 [2024-07-18 15:03:49,363] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-18 15:03:49,366] [INFO] [checkpointing.py:1048:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False} [2024-07-18 15:03:49,369] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 4741 and data parallel seed: 2023 [2024-07-18 15:03:49,372] [INFO] [RANK 0] building FineTuneTrainCogVLMModel model ... [2024-07-18 15:03:59,465] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 17639685376 [2024-07-18 15:04:54,090] [INFO] [RANK 0] global rank 0 is loading checkpoint /GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models/cogvlm-chat-v1.1/1/mp_rank_00_model_states.pt [2024-07-18 15:05:43,077] [INFO] [RANK 0] > successfully loaded /GLOBALFS/dhu_mbzhao_1/CogVLM-main/.sat_models/cogvlm-chat-v1.1/1/mp_rank_00_model_states.pt [2024-07-18 15:05:44,114] [INFO] [RANK 0] replacing layer 0 attention with lora [2024-07-18 15:05:44,864] [INFO] [RANK 0] replacing layer 1 attention with lora [2024-07-18 15:05:45,654] [INFO] [RANK 0] replacing layer 2 attention with lora [2024-07-18 15:05:46,351] [INFO] [RANK 0] replacing layer 3 attention with lora [2024-07-18 15:05:47,077] [INFO] [RANK 0] replacing layer 4 attention with lora [2024-07-18 15:05:47,871] [INFO] [RANK 0] replacing layer 5 attention with lora [2024-07-18 15:05:48,692] [INFO] [RANK 0] replacing layer 6 attention with lora [2024-07-18 15:05:49,551] [INFO] [RANK 0] replacing layer 7 attention with lora [2024-07-18 15:05:50,375] [INFO] [RANK 0] replacing layer 8 attention with lora [2024-07-18 15:05:51,153] [INFO] [RANK 0] replacing layer 9 attention with lora [2024-07-18 15:05:51,949] [INFO] [RANK 0] replacing layer 10 attention with lora [2024-07-18 15:05:52,892] [INFO] [RANK 0] replacing layer 11 attention with lora [2024-07-18 15:05:53,677] [INFO] [RANK 0] replacing layer 12 attention with lora [2024-07-18 15:05:54,587] [INFO] [RANK 0] replacing layer 13 attention with lora [2024-07-18 15:05:55,295] [INFO] [RANK 0] replacing layer 14 attention with lora [2024-07-18 15:05:56,079] [INFO] [RANK 0] replacing layer 15 attention with lora [2024-07-18 15:05:56,938] [INFO] [RANK 0] replacing layer 16 attention with lora [2024-07-18 15:05:57,762] [INFO] [RANK 0] replacing layer 17 attention with lora [2024-07-18 15:05:58,654] [INFO] [RANK 0] replacing layer 18 attention with lora [2024-07-18 15:05:59,468] [INFO] [RANK 0] replacing layer 19 attention with lora [2024-07-18 15:06:00,300] [INFO] [RANK 0] replacing layer 20 attention with lora [2024-07-18 15:06:01,055] [INFO] [RANK 0] replacing layer 21 attention with lora [2024-07-18 15:06:02,043] [INFO] [RANK 0] replacing layer 22 attention with lora [2024-07-18 15:06:02,786] [INFO] [RANK 0] replacing layer 23 attention with lora [2024-07-18 15:06:03,570] [INFO] [RANK 0] replacing layer 24 attention with lora [2024-07-18 15:06:04,406] [INFO] [RANK 0] replacing layer 25 attention with lora [2024-07-18 15:06:05,249] [INFO] [RANK 0] replacing layer 26 attention with lora [2024-07-18 15:06:06,080] [INFO] [RANK 0] replacing layer 27 attention with lora [2024-07-18 15:06:06,862] [INFO] [RANK 0] replacing layer 28 attention with lora [2024-07-18 15:06:08,048] [INFO] [RANK 0] replacing layer 29 attention with lora [2024-07-18 15:06:08,829] [INFO] [RANK 0] replacing layer 30 attention with lora [2024-07-18 15:06:09,577] [INFO] [RANK 0] replacing layer 31 attention with lora [2024-07-18 15:06:10,367] [INFO] [RANK 0] replacing layer 0 attention with lora [2024-07-18 15:06:10,480] [INFO] [RANK 0] replacing layer 1 attention with lora [2024-07-18 15:06:10,589] [INFO] [RANK 0] replacing layer 2 attention with lora [2024-07-18 15:06:10,832] [INFO] [RANK 0] replacing layer 3 attention with lora [2024-07-18 15:06:11,036] [INFO] [RANK 0] replacing layer 4 attention with lora [2024-07-18 15:06:11,243] [INFO] [RANK 0] replacing layer 5 attention with lora [2024-07-18 15:06:11,437] [INFO] [RANK 0] replacing layer 6 attention with lora [2024-07-18 15:06:11,644] [INFO] [RANK 0] replacing layer 7 attention with lora [2024-07-18 15:06:11,851] [INFO] [RANK 0] replacing layer 8 attention with lora [2024-07-18 15:06:12,125] [INFO] [RANK 0] replacing layer 9 attention with lora [2024-07-18 15:06:12,333] [INFO] [RANK 0] replacing layer 10 attention with lora [2024-07-18 15:06:12,469] [INFO] [RANK 0] replacing layer 11 attention with lora [2024-07-18 15:06:12,655] [INFO] [RANK 0] replacing layer 12 attention with lora [2024-07-18 15:06:12,857] [INFO] [RANK 0] replacing layer 13 attention with lora [2024-07-18 15:06:13,064] [INFO] [RANK 0] replacing layer 14 attention with lora [2024-07-18 15:06:13,325] [INFO] [RANK 0] replacing layer 15 attention with lora [2024-07-18 15:06:13,541] [INFO] [RANK 0] replacing layer 16 attention with lora [2024-07-18 15:06:13,763] [INFO] [RANK 0] replacing layer 17 attention with lora [2024-07-18 15:06:14,028] [INFO] [RANK 0] replacing layer 18 attention with lora [2024-07-18 15:06:14,241] [INFO] [RANK 0] replacing layer 19 attention with lora [2024-07-18 15:06:14,443] [INFO] [RANK 0] replacing layer 20 attention with lora [2024-07-18 15:06:14,642] [INFO] [RANK 0] replacing layer 21 attention with lora [2024-07-18 15:06:14,843] [INFO] [RANK 0] replacing layer 22 attention with lora [2024-07-18 15:06:15,035] [INFO] [RANK 0] replacing layer 23 attention with lora [2024-07-18 15:06:15,226] [INFO] [RANK 0] replacing layer 24 attention with lora [2024-07-18 15:06:15,443] [INFO] [RANK 0] replacing layer 25 attention with lora [2024-07-18 15:06:15,626] [INFO] [RANK 0] replacing layer 26 attention with lora [2024-07-18 15:06:15,832] [INFO] [RANK 0] replacing layer 27 attention with lora [2024-07-18 15:06:15,997] [INFO] [RANK 0] replacing layer 28 attention with lora [2024-07-18 15:06:16,190] [INFO] [RANK 0] replacing layer 29 attention with lora [2024-07-18 15:06:16,437] [INFO] [RANK 0] replacing layer 30 attention with lora [2024-07-18 15:06:16,639] [INFO] [RANK 0] replacing layer 31 attention with lora [2024-07-18 15:06:16,846] [INFO] [RANK 0] replacing layer 32 attention with lora [2024-07-18 15:06:17,052] [INFO] [RANK 0] replacing layer 33 attention with lora [2024-07-18 15:06:17,250] [INFO] [RANK 0] replacing layer 34 attention with lora [2024-07-18 15:06:17,453] [INFO] [RANK 0] replacing layer 35 attention with lora [2024-07-18 15:06:17,652] [INFO] [RANK 0] replacing layer 36 attention with lora [2024-07-18 15:06:17,926] [INFO] [RANK 0] replacing layer 37 attention with lora [2024-07-18 15:06:18,139] [INFO] [RANK 0] replacing layer 38 attention with lora [2024-07-18 15:06:18,348] [INFO] [RANK 0] replacing layer 39 attention with lora [2024-07-18 15:06:18,540] [INFO] [RANK 0] replacing layer 40 attention with lora [2024-07-18 15:06:18,741] [INFO] [RANK 0] replacing layer 41 attention with lora [2024-07-18 15:06:18,934] [INFO] [RANK 0] replacing layer 42 attention with lora [2024-07-18 15:06:19,126] [INFO] [RANK 0] replacing layer 43 attention with lora [2024-07-18 15:06:19,346] [INFO] [RANK 0] replacing layer 44 attention with lora [2024-07-18 15:06:19,545] [INFO] [RANK 0] replacing layer 45 attention with lora [2024-07-18 15:06:19,745] [INFO] [RANK 0] replacing layer 46 attention with lora [2024-07-18 15:06:19,930] [INFO] [RANK 0] replacing layer 47 attention with lora [2024-07-18 15:06:20,122] [INFO] [RANK 0] replacing layer 48 attention with lora [2024-07-18 15:06:20,327] [INFO] [RANK 0] replacing layer 49 attention with lora [2024-07-18 15:06:20,534] [INFO] [RANK 0] replacing layer 50 attention with lora [2024-07-18 15:06:20,733] [INFO] [RANK 0] replacing layer 51 attention with lora [2024-07-18 15:06:20,970] [INFO] [RANK 0] replacing layer 52 attention with lora [2024-07-18 15:06:21,163] [INFO] [RANK 0] replacing layer 53 attention with lora [2024-07-18 15:06:21,424] [INFO] [RANK 0] replacing layer 54 attention with lora [2024-07-18 15:06:21,643] [INFO] [RANK 0] replacing layer 55 attention with lora [2024-07-18 15:06:21,842] [INFO] [RANK 0] replacing layer 56 attention with lora [2024-07-18 15:06:22,030] [INFO] [RANK 0] replacing layer 57 attention with lora [2024-07-18 15:06:22,230] [INFO] [RANK 0] replacing layer 58 attention with lora [2024-07-18 15:06:22,433] [INFO] [RANK 0] replacing layer 59 attention with lora [2024-07-18 15:06:22,580] [INFO] [RANK 0] replacing layer 60 attention with lora [2024-07-18 15:06:22,780] [INFO] [RANK 0] replacing layer 61 attention with lora [2024-07-18 15:06:23,041] [INFO] [RANK 0] replacing layer 62 attention with lora [2024-07-18 15:06:23,776] [INFO] [RANK 0] find 0 files... [2024-07-18 15:06:23,776] [INFO] [RANK 0] find 0 samples in all... [rank3]: Traceback (most recent call last): [rank3]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank3]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank3]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank3]: train = make_dataset(data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank3]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank3]: scale = max(200, 1 + (args.train_iters args.batch_size args.gradient_accumulation_steps * world_size) // len(ds))
[rank3]: ZeroDivisionError: integer division or modulo by zero
[rank0]: Traceback (most recent call last):
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank0]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank0]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank0]: train = make_dataset(*data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank0]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank0]: scale = max(200, 1 + (args.train_iters args.batch_size args.gradient_accumulation_steps world_size) // len(ds))
[rank0]: ZeroDivisionError: integer division or modulo by zero
[rank2]: Traceback (most recent call last):
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank2]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank2]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank2]: train = make_dataset( data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank2]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank2]: scale = max(200, 1 + (args.train_iters args.batch_size args.gradient_accumulation_steps * world_size) // len(ds))
[rank2]: ZeroDivisionError: integer division or modulo by zero
[rank1]: Traceback (most recent call last):
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/CogVLM-main/finetune_demo/finetune_cogvlm_demo.py", line 256, in
[rank1]: model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main
[rank1]: train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 201, in make_loaders
[rank1]: train = make_dataset(*data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
[rank1]: File "/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 139, in make_dataset_full
[rank1]: scale = max(200, 1 + (args.train_iters args.batch_size args.gradient_accumulation_steps world_size) // len(ds))
[rank1]: ZeroDivisionError: integer division or modulo by zero
[2024-07-18 15:06:25,946] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56061
[2024-07-18 15:06:25,949] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56062
[2024-07-18 15:06:25,952] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56063
[2024-07-18 15:06:25,952] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56064
[2024-07-18 15:06:25,954] [ERROR] [launch.py:325:sigkill_handler] ['/GLOBALFS/dhu_mbzhao_1/anaconda3/envs/cogvlm/bin/python', '-u', 'finetune_cogvlm_demo.py', '--local_rank=3', '--experiment-name', 'finetune-cogvlm-chat-v1.1', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '800', '--resume-dataloader', '--from_pretrained', 'cogvlm-chat-v1.1', '--max_length', '1288', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/GLOBALFS/dhu_mbzhao_1/CogVLM-main/vicuna-7b-v1.5', '--version', 'base', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023'] exits with return code = 1
This place has a divisor of 0, I don't know how to solve it. Could someone can help me?
Expected behavior / 期待表现
finetune is OK.