Closed abhijeetsourav closed 1 month ago
Your gpu dosen't support bf16. Use fp16 instead.
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: [2024-09-14 11:10:18,321] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 705
[2024-09-14 11:10:48,344] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 706
[2024-09-14 11:10:48,344] [ERROR] [launch.py:322:sigkill_handler] ['/opt/conda/bin/python3.10', '-u', 'src/training/train.py', '--local_rank=1', '--lora_enable', 'True', '--vision_lora', 'True', '--lora_namespan_exclude', "['lm_head', 'embed_tokens']", '--lora_rank', '32', '--lora_alpha', '16', '--lora_dropout', '0.05', '--num_lora_modules', '-1', '--deepspeed', 'scripts/zero3.json', '--model_id', 'microsoft/Phi-3.5-vision-instruct', '--data_path', '/kaggle/working/dataset.json', '--image_folder', '/kaggle/working/dataset/train_images', '--tune_img_projector', 'True', '--freeze_vision_tower', 'False', '--bf16', 'False', '--fp16', 'True', '--disable_flash_attn2', 'False', '--output_dir', 'output/lora_vision_test', '--num_crops', '16', '--num_train_epochs', '2', '--per_device_train_batch_size', '4', '--gradient_accumulation_steps', '4', '--learning_rate', '2e-4', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'False', '--gradient_checkpointing', 'True', '--report_to', 'wandb', '--lazy_preprocess', 'True', '--dataloader_num_workers', '4'] exits with return code = 1
how to resolve this error, how to set up wandb in the code
You could just press 1 or 2 Or just use tensorboard.
I'm trying to run the following code in kaggle with GPUP100
!bash /kaggle/working/Phi3-Vision-Finetune/scripts/finetune_lora_vision.sh
complete error
[2024-09-14 09:33:24,960] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-14 09:33:28,221] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-09-14 09:33:28,221] [INFO] [runner.py:568:main] cmd = /opt/conda/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None src/training/train.py --lora_enable True --vision_lora True --lora_namespan_exclude ['lm_head', 'embed_tokens'] --lora_rank 32 --lora_alpha 16 --lora_dropout 0.05 --num_lora_modules -1 --deepspeed scripts/zero3.json --model_id microsoft/Phi-3.5-vision-instruct --data_path /kaggle/working/dataset.json --image_folder /kaggle/working/dataset/train_images --tune_img_projector True --freeze_vision_tower False --bf16 True --fp16 False --disable_flash_attn2 False --output_dir output/lora_vision_test --num_crops 16 --num_train_epochs 2 --per_device_train_batch_size 4 --gradient_accumulation_steps 4 --learning_rate 2e-4 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --gradient_checkpointing True --report_to wandb --lazy_preprocess True --dataloader_num_workers 4 [2024-09-14 09:33:29,263] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-14 09:33:32,430] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.20.3-1+cuda12.3 [2024-09-14 09:33:32,430] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.20.3-1 [2024-09-14 09:33:32,430] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.20.3-1 [2024-09-14 09:33:32,430] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev [2024-09-14 09:33:32,430] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.20.3-1+cuda12.3 [2024-09-14 09:33:32,430] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2 [2024-09-14 09:33:32,430] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.20.3-1 [2024-09-14 09:33:32,430] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]} [2024-09-14 09:33:32,430] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0 [2024-09-14 09:33:32,430] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2024-09-14 09:33:32,430] [INFO] [launch.py:163:main] dist_world_size=1 [2024-09-14 09:33:32,430] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0 [2024-09-14 09:33:32,431] [INFO] [launch.py:253:main] process 389 spawned with command: ['/opt/conda/bin/python3.10', '-u', 'src/training/train.py', '--local_rank=0', '--lora_enable', 'True', '--vision_lora', 'True', '--lora_namespan_exclude', "['lm_head', 'embed_tokens']", '--lora_rank', '32', '--lora_alpha', '16', '--lora_dropout', '0.05', '--num_lora_modules', '-1', '--deepspeed', 'scripts/zero3.json', '--model_id', 'microsoft/Phi-3.5-vision-instruct', '--data_path', '/kaggle/working/dataset.json', '--image_folder', '/kaggle/working/dataset/train_images', '--tune_img_projector', 'True', '--freeze_vision_tower', 'False', '--bf16', 'True', '--fp16', 'False', '--disable_flash_attn2', 'False', '--output_dir', 'output/lora_vision_test', '--num_crops', '16', '--num_train_epochs', '2', '--per_device_train_batch_size', '4', '--gradient_accumulation_steps', '4', '--learning_rate', '2e-4', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'wandb', '--lazy_preprocess', 'True', '--dataloader_num_workers', '4'] Traceback (most recent call last): File "/kaggle/working/Phi3-Vision-Finetune/src/training/train.py", line 225, in <module> train() File "/kaggle/working/Phi3-Vision-Finetune/src/training/train.py", line 68, in train model_args, data_args, training_args = parser.parse_args_into_dataclasses() File "/opt/conda/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses obj = dtype(**inputs) File "<string>", line 148, in __init__ File "/opt/conda/lib/python3.10/site-packages/transformers/training_args.py", line 1595, in __post_init__ raise ValueError( ValueError: Your setup doesn't support bf16/gpu. You need torch>=1.10, using Ampere GPU with cuda>=11.0 [2024-09-14 09:33:40,439] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 389 [2024-09-14 09:33:40,440] [ERROR] [launch.py:322:sigkill_handler] ['/opt/conda/bin/python3.10', '-u', 'src/training/train.py', '--local_rank=0', '--lora_enable', 'True', '--vision_lora', 'True', '--lora_namespan_exclude', "['lm_head', 'embed_tokens']", '--lora_rank', '32', '--lora_alpha', '16', '--lora_dropout', '0.05', '--num_lora_modules', '-1', '--deepspeed', 'scripts/zero3.json', '--model_id', 'microsoft/Phi-3.5-vision-instruct', '--data_path', '/kaggle/working/dataset.json', '--image_folder', '/kaggle/working/dataset/train_images', '--tune_img_projector', 'True', '--freeze_vision_tower', 'False', '--bf16', 'True', '--fp16', 'False', '--disable_flash_attn2', 'False', '--output_dir', 'output/lora_vision_test', '--num_crops', '16', '--num_train_epochs', '2', '--per_device_train_batch_size', '4', '--gradient_accumulation_steps', '4', '--learning_rate', '2e-4', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'wandb', '--lazy_preprocess', 'True', '--dataloader_num_workers', '4'] exits with return code = 1