BAAI-DCAI / Bunny

A family of lightweight multimodal models.
Apache License 2.0
865 stars 65 forks source link

train.py: error: the following arguments are required: --output_dir #46

Closed swhoosh closed 4 months ago

swhoosh commented 4 months ago

I am trying to finetune my model for a specific task using my own dataset. I have already format the dataset correctly according to the docs. Here I got weird error of train.py: error: the following arguments are required: --output_dir in the subprocesses even I already put it in my arguments. Do you have ideas what might be the cause of this? Thanks!

This is my finetune.sh


MODEL_PATH=/image_text/models/Bunny-v1_0-3B
MODEL_TYPE=phi-2

PRETRAIN_DIR=bunny-$MODEL_TYPE-pretrain
OUTPUT_DIR=bunny-$MODEL_TYPE-test

# JSON LIST
DATA_PATH=image_text/train_list/train_single_image.json
IMAGE_FOLDER=image_text/datasets

mkdir -p ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR

deepspeed bunny/train/train.py \
    --deepspeed ./script/deepspeed/zero3.json \
    --model_name_or_path $MODEL_PATH \
    --model_type $MODEL_TYPE \
    --version bunny \
    --data_path $DATA_PATH \
    --image_folder $IMAGE_FOLDER \
    --vision_tower google/siglip-so400m-patch14-384 \
    # --pretrain_mm_mlp_adapter ./checkpoints-pretrain/$PRETRAIN_DIR/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --image_aspect_ratio pad \
    --group_by_modality_length False \
    --bf16 True \
    --output_dir ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to none | tee 2>&1 ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR/log.txt

This is the error I got.

root@sv:/image_text/Bunny# [2024-04-19 16:06:53,436] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[2024-04-19 16:06:54,635] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-04-19 16:06:54,636] [INFO] [runner.py:568:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None bunny/train/train.py --deepspeed ./script/deepspeed/zero3.json --model_name_or_path BAAI/Bunny-v1_0-3B --model_type phi-2 --version bunny --data_path /image_text/train_list/train_impression_single_image.json --image_folder /image_text/datasets --vision_tower google/siglip-so400m-patch14-384
[2024-04-19 16:06:57,324] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[2024-04-19 16:06:59,716] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.16.2-1
[2024-04-19 16:06:59,716] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.16.2-1
[2024-04-19 16:06:59,717] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2024-04-19 16:06:59,717] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2024-04-19 16:06:59,717] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.16.2-1+cuda11.8
[2024-04-19 16:06:59,717] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.16.2-1+cuda11.8
[2024-04-19 16:06:59,717] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.16.2-1
[2024-04-19 16:06:59,717] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-04-19 16:06:59,717] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-04-19 16:06:59,717] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-04-19 16:06:59,717] [INFO] [launch.py:163:main] dist_world_size=8
[2024-04-19 16:06:59,717] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2024-04-19 16:06:59,737] [INFO] [launch.py:253:main] process 4574 spawned with command: ['/usr/bin/python3', '-u', 'bunny/train/train.py', '--local_rank=0', '--deepspeed', './script/deepspeed/zero3.json', '--model_name_or_path', 'BAAI/Bunny-v1_0-3B', '--model_type', 'phi-2', '--version', 'bunny', '--data_path', '/image_text/train_list/train_impression_single_image.json', '--image_folder', '/image_text/datasets', '--vision_tower', 'google/siglip-so400m-patch14-384']
[2024-04-19 16:06:59,748] [INFO] [launch.py:253:main] process 4575 spawned with command: ['/usr/bin/python3', '-u', 'bunny/train/train.py', '--local_rank=1', '--deepspeed', './script/deepspeed/zero3.json', '--model_name_or_path', 'BAAI/Bunny-v1_0-3B', '--model_type', 'phi-2', '--version', 'bunny', '--data_path', '/image_text/train_list/train_impression_single_image.json', '--image_folder', '/image_text/datasets', '--vision_tower', 'google/siglip-so400m-patch14-384']
[2024-04-19 16:06:59,761] [INFO] [launch.py:253:main] process 4576 spawned with command: ['/usr/bin/python3', '-u', 'bunny/train/train.py', '--local_rank=2', '--deepspeed', './script/deepspeed/zero3.json', '--model_name_or_path', 'BAAI/Bunny-v1_0-3B', '--model_type', 'phi-2', '--version', 'bunny', '--data_path', '/image_text/train_list/train_impression_single_image.json', '--image_folder', '/image_text/datasets', '--vision_tower', 'google/siglip-so400m-patch14-384']
[2024-04-19 16:06:59,773] [INFO] [launch.py:253:main] process 4577 spawned with command: ['/usr/bin/python3', '-u', 'bunny/train/train.py', '--local_rank=3', '--deepspeed', './script/deepspeed/zero3.json', '--model_name_or_path', 'BAAI/Bunny-v1_0-3B', '--model_type', 'phi-2', '--version', 'bunny', '--data_path', '/image_text/train_list/train_impression_single_image.json', '--image_folder', '/image_text/datasets', '--vision_tower', 'google/siglip-so400m-patch14-384']
[2024-04-19 16:06:59,791] [INFO] [launch.py:253:main] process 4579 spawned with command: ['/usr/bin/python3', '-u', 'bunny/train/train.py', '--local_rank=4', '--deepspeed', './script/deepspeed/zero3.json', '--model_name_or_path', 'BAAI/Bunny-v1_0-3B', '--model_type', 'phi-2', '--version', 'bunny', '--data_path', '/image_text/train_list/train_impression_single_image.json', '--image_folder', '/image_text/datasets', '--vision_tower', 'google/siglip-so400m-patch14-384']
[2024-04-19 16:06:59,810] [INFO] [launch.py:253:main] process 4581 spawned with command: ['/usr/bin/python3', '-u', 'bunny/train/train.py', '--local_rank=5', '--deepspeed', './script/deepspeed/zero3.json', '--model_name_or_path', 'BAAI/Bunny-v1_0-3B', '--model_type', 'phi-2', '--version', 'bunny', '--data_path', '/image_text/train_list/train_impression_single_image.json', '--image_folder', '/image_text/datasets', '--vision_tower', 'google/siglip-so400m-patch14-384']
[2024-04-19 16:06:59,829] [INFO] [launch.py:253:main] process 4584 spawned with command: ['/usr/bin/python3', '-u', 'bunny/train/train.py', '--local_rank=6', '--deepspeed', './script/deepspeed/zero3.json', '--model_name_or_path', 'BAAI/Bunny-v1_0-3B', '--model_type', 'phi-2', '--version', 'bunny', '--data_path', '/image_text/train_list/train_impression_single_image.json', '--image_folder', '/image_text/datasets', '--vision_tower', 'google/siglip-so400m-patch14-384']
[2024-04-19 16:06:59,848] [INFO] [launch.py:253:main] process 4586 spawned with command: ['/usr/bin/python3', '-u', 'bunny/train/train.py', '--local_rank=7', '--deepspeed', './script/deepspeed/zero3.json', '--model_name_or_path', 'BAAI/Bunny-v1_0-3B', '--model_type', 'phi-2', '--version', 'bunny', '--data_path', '/image_text/train_list/train_impression_single_image.json', '--image_folder', '/image_text/datasets', '--vision_tower', 'google/siglip-so400m-patch14-384']
usage: train.py [-h] [--model_name_or_path MODEL_NAME_OR_PATH] [--model_type MODEL_TYPE]
                [--version VERSION] [--freeze_backbone [FREEZE_BACKBONE]]
                [--tune_mm_mlp_adapter [TUNE_MM_MLP_ADAPTER]] [--vision_tower VISION_TOWER]
                [--pretrain_mm_mlp_adapter PRETRAIN_MM_MLP_ADAPTER]
                [--mm_projector_type MM_PROJECTOR_TYPE] [--data_path DATA_PATH]
                [--lazy_preprocess [LAZY_PREPROCESS]] [--is_multimodal [IS_MULTIMODAL]]
                [--no_is_multimodal] [--image_folder IMAGE_FOLDER]
                [--image_aspect_ratio IMAGE_ASPECT_RATIO] --output_dir OUTPUT_DIR
                [--overwrite_output_dir [OVERWRITE_OUTPUT_DIR]] [--do_train [DO_TRAIN]]
                [--do_eval [DO_EVAL]] [--do_predict [DO_PREDICT]]
                [--evaluation_strategy {no,steps,epoch}]
                [--prediction_loss_only [PREDICTION_LOSS_ONLY]]
                [--per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE]
                [--per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE]
                [--per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE]
                [--per_gpu_eval_batch_size PER_GPU_EVAL_BATCH_SIZE]
                [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
                [--eval_accumulation_steps EVAL_ACCUMULATION_STEPS] [--eval_delay EVAL_DELAY]
                [--learning_rate LEARNING_RATE] [--weight_decay WEIGHT_DECAY]
                [--adam_beta1 ADAM_BETA1] [--adam_beta2 ADAM_BETA2]
                [--adam_epsilon ADAM_EPSILON] [--max_grad_norm MAX_GRAD_NORM]
                [--num_train_epochs NUM_TRAIN_EPOCHS] [--max_steps MAX_STEPS]
                [--lr_scheduler_type {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup,inverse_sqrt,reduce_lr_on_plateau}]
                [--lr_scheduler_kwargs LR_SCHEDULER_KWARGS] [--warmup_ratio WARMUP_RATIO]
                [--warmup_steps WARMUP_STEPS]
                [--log_level {detail,debug,info,warning,error,critical,passive}]
                [--log_level_replica {detail,debug,info,warning,error,critical,passive}]
                [--log_on_each_node [LOG_ON_EACH_NODE]] [--no_log_on_each_node]
                [--logging_dir LOGGING_DIR] [--logging_strategy {no,steps,epoch}]
                [--logging_first_step [LOGGING_FIRST_STEP]] [--logging_steps LOGGING_STEPS]
                [--logging_nan_inf_filter [LOGGING_NAN_INF_FILTER]]
                [--no_logging_nan_inf_filter] [--save_strategy {no,steps,epoch}]
                [--save_steps SAVE_STEPS] [--save_total_limit SAVE_TOTAL_LIMIT]
                [--save_safetensors [SAVE_SAFETENSORS]] [--no_save_safetensors]
                [--save_on_each_node [SAVE_ON_EACH_NODE]]
                [--save_only_model [SAVE_ONLY_MODEL]] [--no_cuda [NO_CUDA]]
                [--use_cpu [USE_CPU]] [--use_mps_device [USE_MPS_DEVICE]] [--seed SEED]
                [--data_seed DATA_SEED] [--jit_mode_eval [JIT_MODE_EVAL]]
                [--use_ipex [USE_IPEX]] [--bf16 [BF16]] [--fp16 [FP16]]
                [--fp16_opt_level FP16_OPT_LEVEL]
                [--half_precision_backend {auto,apex,cpu_amp}]
                [--bf16_full_eval [BF16_FULL_EVAL]] [--fp16_full_eval [FP16_FULL_EVAL]]
                [--tf32 TF32] [--local_rank LOCAL_RANK]
                [--ddp_backend {nccl,gloo,mpi,ccl,hccl}] [--tpu_num_cores TPU_NUM_CORES]
                [--tpu_metrics_debug [TPU_METRICS_DEBUG]] [--debug DEBUG [DEBUG ...]]
                [--dataloader_drop_last [DATALOADER_DROP_LAST]] [--eval_steps EVAL_STEPS]
                [--dataloader_num_workers DATALOADER_NUM_WORKERS]
                [--dataloader_prefetch_factor DATALOADER_PREFETCH_FACTOR]
                [--past_index PAST_INDEX] [--run_name RUN_NAME] [--disable_tqdm DISABLE_TQDM]
                [--remove_unused_columns [REMOVE_UNUSED_COLUMNS]]
                [--label_names LABEL_NAMES [LABEL_NAMES ...]]
                [--load_best_model_at_end [LOAD_BEST_MODEL_AT_END]]
                [--metric_for_best_model METRIC_FOR_BEST_MODEL]
                [--greater_is_better GREATER_IS_BETTER]
                [--ignore_data_skip [IGNORE_DATA_SKIP]] [--fsdp FSDP]
                [--fsdp_min_num_params FSDP_MIN_NUM_PARAMS] [--fsdp_config FSDP_CONFIG]
                [--fsdp_transformer_layer_cls_to_wrap FSDP_TRANSFORMER_LAYER_CLS_TO_WRAP]
                [--accelerator_config ACCELERATOR_CONFIG] [--deepspeed DEEPSPEED]
                [--label_smoothing_factor LABEL_SMOOTHING_FACTOR] [--optim OPTIM]
                [--optim_args OPTIM_ARGS] [--adafactor [ADAFACTOR]]
                [--group_by_length [GROUP_BY_LENGTH]]
                [--length_column_name LENGTH_COLUMN_NAME]
                [--report_to REPORT_TO [REPORT_TO ...]]
                [--ddp_find_unused_parameters DDP_FIND_UNUSED_PARAMETERS]
                [--ddp_bucket_cap_mb DDP_BUCKET_CAP_MB]
                [--ddp_broadcast_buffers DDP_BROADCAST_BUFFERS]
                [--dataloader_pin_memory [DATALOADER_PIN_MEMORY]]
                [--no_dataloader_pin_memory]
                [--dataloader_persistent_workers [DATALOADER_PERSISTENT_WORKERS]]
                [--skip_memory_metrics [SKIP_MEMORY_METRICS]] [--no_skip_memory_metrics]
                [--use_legacy_prediction_loop [USE_LEGACY_PREDICTION_LOOP]]
                [--push_to_hub [PUSH_TO_HUB]]
                [--resume_from_checkpoint RESUME_FROM_CHECKPOINT]
                [--hub_model_id HUB_MODEL_ID]
                [--hub_strategy {end,every_save,checkpoint,all_checkpoints}]
                [--hub_token HUB_TOKEN] [--hub_private_repo [HUB_PRIVATE_REPO]]
                [--hub_always_push [HUB_ALWAYS_PUSH]]
                [--gradient_checkpointing [GRADIENT_CHECKPOINTING]]
                [--gradient_checkpointing_kwargs GRADIENT_CHECKPOINTING_KWARGS]
                [--include_inputs_for_metrics [INCLUDE_INPUTS_FOR_METRICS]]
                [--fp16_backend {auto,apex,cpu_amp}]
                [--push_to_hub_model_id PUSH_TO_HUB_MODEL_ID]
                [--push_to_hub_organization PUSH_TO_HUB_ORGANIZATION]
                [--push_to_hub_token PUSH_TO_HUB_TOKEN] [--mp_parameters MP_PARAMETERS]
                [--auto_find_batch_size [AUTO_FIND_BATCH_SIZE]]
                [--full_determinism [FULL_DETERMINISM]] [--torchdynamo TORCHDYNAMO]
                [--ray_scope RAY_SCOPE] [--ddp_timeout DDP_TIMEOUT]
                [--torch_compile [TORCH_COMPILE]]
                [--torch_compile_backend TORCH_COMPILE_BACKEND]
                [--torch_compile_mode TORCH_COMPILE_MODE]
                [--dispatch_batches DISPATCH_BATCHES] [--split_batches SPLIT_BATCHES]
                [--include_tokens_per_second [INCLUDE_TOKENS_PER_SECOND]]
                [--include_num_input_tokens_seen [INCLUDE_NUM_INPUT_TOKENS_SEEN]]
                [--neftune_noise_alpha NEFTUNE_NOISE_ALPHA]
                [--optim_target_modules OPTIM_TARGET_MODULES] [--cache_dir CACHE_DIR]
                [--freeze_mm_mlp_adapter [FREEZE_MM_MLP_ADAPTER]]
                [--mpt_attn_impl MPT_ATTN_IMPL] [--model_max_length MODEL_MAX_LENGTH]
                [--double_quant [DOUBLE_QUANT]] [--no_double_quant] [--quant_type QUANT_TYPE]
                [--bits BITS] [--lora_enable [LORA_ENABLE]] [--lora_r LORA_R]
                [--lora_alpha LORA_ALPHA] [--lora_dropout LORA_DROPOUT]
                [--lora_weight_path LORA_WEIGHT_PATH] [--lora_bias LORA_BIAS]
                [--mm_projector_lr MM_PROJECTOR_LR]
                [--group_by_modality_length [GROUP_BY_MODALITY_LENGTH]]
train.py: error: the following arguments are required: --output_dir
... 
repeat for all 8 subprocesses
...
[2024-04-19 16:07:06,856] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 4574
[2024-04-19 16:07:06,858] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 4575
[2024-04-19 16:07:06,859] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 4576
[2024-04-19 16:07:06,859] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 4577
[2024-04-19 16:07:06,860] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 4579
[2024-04-19 16:07:06,860] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 4581
[2024-04-19 16:07:06,861] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 4584
[2024-04-19 16:07:06,861] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 4586
[2024-04-19 16:07:06,861] [ERROR] [launch.py:322:sigkill_handler] ['/usr/bin/python3', '-u', 'bunny/train/train.py', '--local_rank=7', '--deepspeed', './script/deepspeed/zero3.json', '--model_name_or_path', 'BAAI/Bunny-v1_0-3B', '--model_type', 'phi-2', '--version', 'bunny', '--data_path', '/image_text/train_list/train_impression_single_image.json', '--image_folder', '/image_text/datasets', '--vision_tower', 'google/siglip-so400m-patch14-384'] exits with return code = 2
script/train/finetune_full_baseline.sh: 25: --mm_projector_
swhoosh commented 4 months ago

Finally figured it. I commented my python arguments in the bash script. Deleting the commented arguments fixed it.