how to finetune on user's own data

hi all, thx for your work. I wonder how to finetune on user's self-made dataset like qwen-vl here: https://github.com/QwenLM/Qwen-VL/blob/master/README_CN.md#%E5%BE%AE%E8%B0%83 we hace tested the pretrained mobile-vlm model on specific domain dataset,the result seems not promising enough, so we want to have a try on our own dataset, thx.

Sorry for the late reply. Please run this script on your dataset to finetune on your own dataset:

bash run.sh mobilevlm_v2_1.7b finetune ${LANGUAGE_MODEL} ${VISION_MODEL} ${OUTPUT_DIR}

And check the --data_path and --image_folder in run.sh, which should match to your self-made dataset. When you build your dataset, you need to put all your data samples in a list and store them in a json file (format is same with llava/sharegpt4v). Each sample corresponds to a dict containing "id", "image" and "conversations". Here are some examples:

[
# Example 1: multi_modal_data
{
'id': 37,
'image': 'coco/train2017/000000203564.jpg',
'conversations': [
    {'from': 'human', 'value': '<image>\nWhat are the key elements in this picture?'}, 
    {'from': 'gpt', 'value': 'A bicycle replica with a clock as the front wheel.'}
    ]
},
# Example 2: text_only_data
{
'id': 0,
'conversations': [
    {'from': 'human', 'value': 'Give three tips for staying healthy.'}, 
    {'from': 'gpt', 'value': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}
    ]
},
# Example 3: multi_modal_dialog_data
{
'id': 287140, 
'image': 'coco/train2014/COCO_train2014_000000287140.jpg', 
'conversations': [
    {'from': 'human', 'value': '<image>\nhow many bikes there'}, 
    {'from': 'gpt', 'value': '3'}, 
    {'from': 'human', 'value': 'what color are bikes'}, 
    {'from': 'gpt', 'value': 'i see green red and white'}, 
    {'from': 'human', 'value': 'are they parked on stock parking'}, 
    {'from': 'gpt', 'value': 'no'}, 
    {'from': 'human', 'value': 'are there any people'}, 
    {'from': 'gpt', 'value': '2'}, 
    {'from': 'human', 'value': 'what are their genders'}, 
    {'from': 'gpt', 'value': "can't tell have on helmets"}, 
    {'from': 'human', 'value': 'what color of their helmets'}, 
    {'from': 'gpt', 'value': 'green and white and other is black'}, 
    {'from': 'human', 'value': 'do they pose to picture'}, 
    {'from': 'gpt', 'value': 'no'}, 
    {'from': 'human', 'value': 'are they standing by their bikes'}, 
    {'from': 'gpt', 'value': 'no'}, 
    {'from': 'human', 'value': 'what are they doing'}, 
    {'from': 'gpt', 'value': 'sitting on bikes'}, 
    {'from': 'human', 'value': 'white time of day is it'}, 
    {'from': 'gpt', 'value': 'looks like afternoon'}
    ]
},
... ...
]

Sorry for the late reply. Please run this script on your dataset to finetune on your own dataset:

bash run.sh mobilevlm_v2_1.7b finetune ${LANGUAGE_MODEL} ${VISION_MODEL} ${OUTPUT_DIR}

[
# Example 1: multi_modal_data
{
'id': 37,
'image': 'coco/train2017/000000203564.jpg',
'conversations': [
    {'from': 'human', 'value': '<image>\nWhat are the key elements in this picture?'}, 
    {'from': 'gpt', 'value': 'A bicycle replica with a clock as the front wheel.'}
    ]
},
# Example 2: text_only_data
{
'id': 0,
'conversations': [
    {'from': 'human', 'value': 'Give three tips for staying healthy.'}, 
    {'from': 'gpt', 'value': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}
    ]
},
# Example 3: multi_modal_dialog_data
{
'id': 287140, 
'image': 'coco/train2014/COCO_train2014_000000287140.jpg', 
'conversations': [
    {'from': 'human', 'value': '<image>\nhow many bikes there'}, 
    {'from': 'gpt', 'value': '3'}, 
    {'from': 'human', 'value': 'what color are bikes'}, 
    {'from': 'gpt', 'value': 'i see green red and white'}, 
    {'from': 'human', 'value': 'are they parked on stock parking'}, 
    {'from': 'gpt', 'value': 'no'}, 
    {'from': 'human', 'value': 'are there any people'}, 
    {'from': 'gpt', 'value': '2'}, 
    {'from': 'human', 'value': 'what are their genders'}, 
    {'from': 'gpt', 'value': "can't tell have on helmets"}, 
    {'from': 'human', 'value': 'what color of their helmets'}, 
    {'from': 'gpt', 'value': 'green and white and other is black'}, 
    {'from': 'human', 'value': 'do they pose to picture'}, 
    {'from': 'gpt', 'value': 'no'}, 
    {'from': 'human', 'value': 'are they standing by their bikes'}, 
    {'from': 'gpt', 'value': 'no'}, 
    {'from': 'human', 'value': 'what are they doing'}, 
    {'from': 'gpt', 'value': 'sitting on bikes'}, 
    {'from': 'human', 'value': 'white time of day is it'}, 
    {'from': 'gpt', 'value': 'looks like afternoon'}
    ]
},
... ...
]

When I used my own data to finetune it, I met this problem. And I want to know how to solve it, Thanks!

Start Visual-Instruction Tuning ... [2024-02-27 15:10:04,264] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-27 15:10:05,804] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-02-27 15:10:05,804] [INFO] [runner.py:555:main] cmd = /usr/local/miniconda3/envs/mobilevlm/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None mobilevlm/train/train_mem.py --deepspeed scripts/deepspeed/zero2.json --model_name_or_path mtgv/MobileVLM_V2-1.7B --version v1 --data_path data/finetune_data/new-coco.json --image_folder data/finetune_data/coco/train2017 --vision_tower mtgv/clip-vit-large-patch14-336 --vision_tower_type clip --pretrain_mm_mlp_adapter finetune-results/mobilevlm-1.pretrain/mm_projector.bin --mm_projector_type ldpnet --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir finetune-results/mobilevlm-2.finetune --num_train_epochs 1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 50000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to none [2024-02-27 15:10:07,130] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-27 15:10:08,702] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]} [2024-02-27 15:10:08,702] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0 [2024-02-27 15:10:08,702] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2024-02-27 15:10:08,702] [INFO] [launch.py:163:main] dist_world_size=1 [2024-02-27 15:10:08,702] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0 [2024-02-27 15:10:11,452] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-27 15:10:12,069] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-02-27 15:10:12,069] [INFO] [comm.py:594:init_distributed] cdb=None [2024-02-27 15:10:12,070] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Formatting inputs...Skip in lazy mode Rank: 0 partition count [1, 1] and sizes[(1383208960, False), (25600, False)] 0%| | 0/61921 [00:00<?, ?it/s]Traceback (most recent call last): File "/hy-tmp/MobileVLM-main/mobilevlm/train/train_mem.py", line 13, in train() File "/hy-tmp/MobileVLM-main/mobilevlm/train/train.py", line 893, in train trainer.train() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train return inner_training_loop( File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/transformers/trainer.py", line 1813, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/accelerate/data_loader.py", line 381, in iter dataloader_iter = super().iter() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 441, in iter return self._get_iterator() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1084, in init self._reset(loader, first_iter=True) File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1117, in _reset self._try_put_index() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1351, in _try_put_index index = self._next_index() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 623, in _next_index return next(self._sampler_iter) # may raise StopIteration File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/sampler.py", line 254, in iter for idx in self.sampler: File "/hy-tmp/MobileVLM-main/mobilevlm/train/trainer.py", line 126, in iter indices = get_modality_length_grouped_indices(self.lengths, self.batch_size, self.world_size, generator=self.generator) File "/hy-tmp/MobileVLM-main/mobilevlm/train/trainer.py", line 62, in get_modality_length_grouped_indices assert lang_indices, "Should have at least one language sample." AssertionError: Should have at least one language sample. 0%| | 0/61921 [00:04<?, ?it/s] [2024-02-27 15:11:32,794] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1369 [2024-02-27 15:11:32,794] [ERROR] [launch.py:321:sigkill_handler] ['/usr/local/miniconda3/envs/mobilevlm/bin/python', '-u', 'mobilevlm/train/train_mem.py', '--local_rank=0', '--deepspeed', 'scripts/deepspeed/zero2.json', '--model_name_or_path', 'mtgv/MobileVLM_V2-1.7B', '--version', 'v1', '--data_path', 'data/finetune_data/new-coco.json', '--image_folder', 'data/finetune_data/coco/train2017', '--vision_tower', 'mtgv/clip-vit-large-patch14-336', '--vision_tower_type', 'clip', '--pretrain_mm_mlp_adapter', 'finetune-results/mobilevlm-1.pretrain/mm_projector.bin', '--mm_projector_type', 'ldpnet', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'True', '--bf16', 'True', '--output_dir', 'finetune-results/mobilevlm-2.finetune', '--num_train_epochs', '1', '--per_device_train_batch_size', '16', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'none'] exits with return code = 1 Done. (mobilevlm) root@I182fc8e40600401a67:/hy-tmp/MobileVLM-main# bash run.sh mobilevlm_v2_1.7b finetune Start Visual-Instruction Tuning ... [2024-02-27 15:23:39,461] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-27 15:23:41,059] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-02-27 15:23:41,059] [INFO] [runner.py:555:main] cmd = /usr/local/miniconda3/envs/mobilevlm/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None mobilevlm/train/train_mem.py --deepspeed scripts/deepspeed/zero2.json --model_name_or_path mtgv/MobileVLM_V2-1.7B --version v1 --data_path data/finetune_data/new-coco.json --image_folder data/finetune_data/coco/train2017 --vision_tower mtgv/clip-vit-large-patch14-336 --vision_tower_type clip --pretrain_mm_mlp_adapter finetune-results/mobilevlm-1.pretrain/mm_projector.bin --mm_projector_type ldpnet --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir finetune-results --num_train_epochs 1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 50000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to none [2024-02-27 15:23:42,415] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-27 15:23:43,936] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]} [2024-02-27 15:23:43,937] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0 [2024-02-27 15:23:43,937] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2024-02-27 15:23:43,937] [INFO] [launch.py:163:main] dist_world_size=1 [2024-02-27 15:23:43,937] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0 [2024-02-27 15:23:46,632] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-27 15:23:47,226] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-02-27 15:23:47,226] [INFO] [comm.py:594:init_distributed] cdb=None [2024-02-27 15:23:47,226] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Formatting inputs...Skip in lazy mode Rank: 0 partition count [1, 1] and sizes[(1383208960, False), (25600, False)] 0%| | 0/61921 [00:00<?, ?it/s]Traceback (most recent call last): File "/hy-tmp/MobileVLM-main/mobilevlm/train/train_mem.py", line 13, in train() File "/hy-tmp/MobileVLM-main/mobilevlm/train/train.py", line 893, in train trainer.train() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train return inner_training_loop( File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/transformers/trainer.py", line 1813, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/accelerate/data_loader.py", line 381, in iter dataloader_iter = super().iter() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 441, in iter return self._get_iterator() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1084, in init self._reset(loader, first_iter=True) File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1117, in _reset self._try_put_index() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1351, in _try_put_index index = self._next_index() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 623, in _next_index return next(self._sampler_iter) # may raise StopIteration File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/sampler.py", line 254, in iter for idx in self.sampler: File "/hy-tmp/MobileVLM-main/mobilevlm/train/trainer.py", line 106, in iter indices = get_modality_length_grouped_indices(self.lengths, self.batch_size, self.world_size, generator=self.generator) File "/hy-tmp/MobileVLM-main/mobilevlm/train/trainer.py", line 39, in get_modality_length_grouped_indices lang_indices, lang_lengths = zip(*[(i, -l) for i, l in enumerate(lengths) if l < 0]) ValueError: not enough values to unpack (expected 2, got 0) 0%| | 0/61921 [00:04<?, ?it/s] [2024-02-27 15:25:08,038] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2245 [2024-02-27 15:25:08,038] [ERROR] [launch.py:321:sigkill_handler] ['/usr/local/miniconda3/envs/mobilevlm/bin/python', '-u', 'mobilevlm/train/train_mem.py', '--local_rank=0', '--deepspeed', 'scripts/deepspeed/zero2.json', '--model_name_or_path', 'mtgv/MobileVLM_V2-1.7B', '--version', 'v1', '--data_path', 'data/finetune_data/new-coco.json', '--image_folder', 'data/finetune_data/coco/train2017', '--vision_tower', 'mtgv/clip-vit-large-patch14-336', '--vision_tower_type', 'clip', '--pretrain_mm_mlp_adapter', 'finetune-results/mobilevlm-1.pretrain/mm_projector.bin', '--mm_projector_type', 'ldpnet', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'True', '--bf16', 'True', '--output_dir', 'finetune-results', '--num_train_epochs', '1', '--per_device_train_batch_size', '16', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'none'] exits with return code = 1 Done.

Sorry for the late reply. Please run this script on your dataset to finetune on your own dataset:
bash run.sh mobilevlm_v2_1.7b finetune ${LANGUAGE_MODEL} ${VISION_MODEL} ${OUTPUT_DIR}
And check the --data_path and --image_folder in run.sh, which should match to your self-made dataset. When you build your dataset, you need to put all your data samples in a list and store them in a json file (format is same with llava/sharegpt4v). Each sample corresponds to a dict containing "id", "image" and "conversations". Here are some examples:
[
# Example 1: multi_modal_data
{
'id': 37,
'image': 'coco/train2017/000000203564.jpg',
'conversations': [
    {'from': 'human', 'value': '<image>\nWhat are the key elements in this picture?'}, 
    {'from': 'gpt', 'value': 'A bicycle replica with a clock as the front wheel.'}
    ]
},
# Example 2: text_only_data
{
'id': 0,
'conversations': [
    {'from': 'human', 'value': 'Give three tips for staying healthy.'}, 
    {'from': 'gpt', 'value': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}
    ]
},
# Example 3: multi_modal_dialog_data
{
'id': 287140, 
'image': 'coco/train2014/COCO_train2014_000000287140.jpg', 
'conversations': [
    {'from': 'human', 'value': '<image>\nhow many bikes there'}, 
    {'from': 'gpt', 'value': '3'}, 
    {'from': 'human', 'value': 'what color are bikes'}, 
    {'from': 'gpt', 'value': 'i see green red and white'}, 
    {'from': 'human', 'value': 'are they parked on stock parking'}, 
    {'from': 'gpt', 'value': 'no'}, 
    {'from': 'human', 'value': 'are there any people'}, 
    {'from': 'gpt', 'value': '2'}, 
    {'from': 'human', 'value': 'what are their genders'}, 
    {'from': 'gpt', 'value': "can't tell have on helmets"}, 
    {'from': 'human', 'value': 'what color of their helmets'}, 
    {'from': 'gpt', 'value': 'green and white and other is black'}, 
    {'from': 'human', 'value': 'do they pose to picture'}, 
    {'from': 'gpt', 'value': 'no'}, 
    {'from': 'human', 'value': 'are they standing by their bikes'}, 
    {'from': 'gpt', 'value': 'no'}, 
    {'from': 'human', 'value': 'what are they doing'}, 
    {'from': 'gpt', 'value': 'sitting on bikes'}, 
    {'from': 'human', 'value': 'white time of day is it'}, 
    {'from': 'gpt', 'value': 'looks like afternoon'}
    ]
},
... ...
]
When I used my own data to finetune it, I met this problem. And I want to know how to solve it, Thanks!

Start Visual-Instruction Tuning ... [2024-02-27 15:10:04,264] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-27 15:10:05,804] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-02-27 15:10:05,804] [INFO] [runner.py:555:main] cmd = /usr/local/miniconda3/envs/mobilevlm/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None mobilevlm/train/train_mem.py --deepspeed scripts/deepspeed/zero2.json --model_name_or_path mtgv/MobileVLM_V2-1.7B --version v1 --data_path data/finetune_data/new-coco.json --image_folder data/finetune_data/coco/train2017 --vision_tower mtgv/clip-vit-large-patch14-336 --vision_tower_type clip --pretrain_mm_mlp_adapter finetune-results/mobilevlm-1.pretrain/mm_projector.bin --mm_projector_type ldpnet --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir finetune-results/mobilevlm-2.finetune --num_train_epochs 1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 50000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to none [2024-02-27 15:10:07,130] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-27 15:10:08,702] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]} [2024-02-27 15:10:08,702] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0 [2024-02-27 15:10:08,702] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2024-02-27 15:10:08,702] [INFO] [launch.py:163:main] dist_world_size=1 [2024-02-27 15:10:08,702] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0 [2024-02-27 15:10:11,452] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-27 15:10:12,069] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-02-27 15:10:12,069] [INFO] [comm.py:594:init_distributed] cdb=None [2024-02-27 15:10:12,070] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Formatting inputs...Skip in lazy mode Rank: 0 partition count [1, 1] and sizes[(1383208960, False), (25600, False)] 0%| | 0/61921 [00:00<?, ?it/s]Traceback (most recent call last): File "/hy-tmp/MobileVLM-main/mobilevlm/train/train_mem.py", line 13, in train() File "/hy-tmp/MobileVLM-main/mobilevlm/train/train.py", line 893, in train trainer.train() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train return inner_training_loop( File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/transformers/trainer.py", line 1813, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/accelerate/data_loader.py", line 381, in iter dataloader_iter = super().iter() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 441, in iter return self._get_iterator() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1084, in init self._reset(loader, first_iter=True) File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1117, in _reset self._try_put_index() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1351, in _try_put_index index = self._next_index() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 623, in _next_index return next(self._sampler_iter) # may raise StopIteration File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/sampler.py", line 254, in iter for idx in self.sampler: File "/hy-tmp/MobileVLM-main/mobilevlm/train/trainer.py", line 126, in iter indices = get_modality_length_grouped_indices(self.lengths, self.batch_size, self.world_size, generator=self.generator) File "/hy-tmp/MobileVLM-main/mobilevlm/train/trainer.py", line 62, in get_modality_length_grouped_indices assert lang_indices, "Should have at least one language sample." AssertionError: Should have at least one language sample. 0%| | 0/61921 [00:04<?, ?it/s] [2024-02-27 15:11:32,794] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1369 [2024-02-27 15:11:32,794] [ERROR] [launch.py:321:sigkill_handler] ['/usr/local/miniconda3/envs/mobilevlm/bin/python', '-u', 'mobilevlm/train/train_mem.py', '--local_rank=0', '--deepspeed', 'scripts/deepspeed/zero2.json', '--model_name_or_path', 'mtgv/MobileVLM_V2-1.7B', '--version', 'v1', '--data_path', 'data/finetune_data/new-coco.json', '--image_folder', 'data/finetune_data/coco/train2017', '--vision_tower', 'mtgv/clip-vit-large-patch14-336', '--vision_tower_type', 'clip', '--pretrain_mm_mlp_adapter', 'finetune-results/mobilevlm-1.pretrain/mm_projector.bin', '--mm_projector_type', 'ldpnet', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'True', '--bf16', 'True', '--output_dir', 'finetune-results/mobilevlm-2.finetune', '--num_train_epochs', '1', '--per_device_train_batch_size', '16', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'none'] exits with return code = 1 Done. (mobilevlm) root@I182fc8e40600401a67:/hy-tmp/MobileVLM-main# bash run.sh mobilevlm_v2_1.7b finetune Start Visual-Instruction Tuning ... [2024-02-27 15:23:39,461] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-27 15:23:41,059] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-02-27 15:23:41,059] [INFO] [runner.py:555:main] cmd = /usr/local/miniconda3/envs/mobilevlm/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None mobilevlm/train/train_mem.py --deepspeed scripts/deepspeed/zero2.json --model_name_or_path mtgv/MobileVLM_V2-1.7B --version v1 --data_path data/finetune_data/new-coco.json --image_folder data/finetune_data/coco/train2017 --vision_tower mtgv/clip-vit-large-patch14-336 --vision_tower_type clip --pretrain_mm_mlp_adapter finetune-results/mobilevlm-1.pretrain/mm_projector.bin --mm_projector_type ldpnet --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir finetune-results --num_train_epochs 1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 50000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to none [2024-02-27 15:23:42,415] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-27 15:23:43,936] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]} [2024-02-27 15:23:43,937] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0 [2024-02-27 15:23:43,937] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2024-02-27 15:23:43,937] [INFO] [launch.py:163:main] dist_world_size=1 [2024-02-27 15:23:43,937] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0 [2024-02-27 15:23:46,632] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-27 15:23:47,226] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-02-27 15:23:47,226] [INFO] [comm.py:594:init_distributed] cdb=None [2024-02-27 15:23:47,226] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Formatting inputs...Skip in lazy mode Rank: 0 partition count [1, 1] and sizes[(1383208960, False), (25600, False)] 0%| | 0/61921 [00:00<?, ?it/s]Traceback (most recent call last): File "/hy-tmp/MobileVLM-main/mobilevlm/train/train_mem.py", line 13, in train() File "/hy-tmp/MobileVLM-main/mobilevlm/train/train.py", line 893, in train trainer.train() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train return inner_training_loop( File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/transformers/trainer.py", line 1813, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/accelerate/data_loader.py", line 381, in iter dataloader_iter = super().iter() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 441, in iter return self._get_iterator() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1084, in init self._reset(loader, first_iter=True) File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1117, in _reset self._try_put_index() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1351, in _try_put_index index = self._next_index() File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 623, in _next_index return next(self._sampler_iter) # may raise StopIteration File "/usr/local/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/torch/utils/data/sampler.py", line 254, in iter for idx in self.sampler: File "/hy-tmp/MobileVLM-main/mobilevlm/train/trainer.py", line 106, in iter indices = get_modality_length_grouped_indices(self.lengths, self.batch_size, self.world_size, generator=self.generator) File "/hy-tmp/MobileVLM-main/mobilevlm/train/trainer.py", line 39, in get_modality_length_grouped_indices lang_indices, lang_lengths = zip(*[(i, -l) for i, l in enumerate(lengths) if l < 0]) ValueError: not enough values to unpack (expected 2, got 0) 0%| | 0/61921 [00:04<?, ?it/s] [2024-02-27 15:25:08,038] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2245 [2024-02-27 15:25:08,038] [ERROR] [launch.py:321:sigkill_handler] ['/usr/local/miniconda3/envs/mobilevlm/bin/python', '-u', 'mobilevlm/train/train_mem.py', '--local_rank=0', '--deepspeed', 'scripts/deepspeed/zero2.json', '--model_name_or_path', 'mtgv/MobileVLM_V2-1.7B', '--version', 'v1', '--data_path', 'data/finetune_data/new-coco.json', '--image_folder', 'data/finetune_data/coco/train2017', '--vision_tower', 'mtgv/clip-vit-large-patch14-336', '--vision_tower_type', 'clip', '--pretrain_mm_mlp_adapter', 'finetune-results/mobilevlm-1.pretrain/mm_projector.bin', '--mm_projector_type', 'ldpnet', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'True', '--bf16', 'True', '--output_dir', 'finetune-results', '--num_train_epochs', '1', '--per_device_train_batch_size', '16', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'none'] exits with return code = 1 Done.

In your training dataset, you should have at least one language sample(text_only_data). Reference function: https://github.com/Meituan-AutoML/MobileVLM/blob/main/mobilevlm/train/trainer.py#L37 Note that attribute names in your json file should be surrounded by double quotation marks. ('name' -> "name")

@weifei7

In your training dataset, you should have at least one language sample(text_only_data).

can you please provide a json example?

i've solved this problem. you can check the provided json, at the last line of the json you can see the example

Meituan-AutoML / MobileVLM

how to finetune on user's own data #11