The strange loss from the second step during training

zhi-xuan-chen commented 2 weeks ago

Hello, I found a strange loss during training as follow.

The loss in the first step is normal, but the loss become 0 from the second step. I only enable the gradient checkpoint for saving memory.

Here is my setting:

model_args: ModelArguments(version='v0', model_name_or_path='/jhcnas5/chenzhixuan/checkpoints/Llama-2-7b-chat-hf', model_type='llama2', freeze_backbone=True, pretrain_mllm=None, tune_mm_mlp_adapter=False, pretrain_mm_mlp_adapter=None, image_channel=1, image_size=(32, 256, 256), patch_size=(4, 16, 16), vision_tower='vit3d', vision_select_layer=-1, vision_select_feature='patch', pretrain_vision_model=None, freeze_vision_tower=False, mm_projector_type='spp', proj_layer_type='mlp', proj_layer_num=2, proj_pooling_type='spatial', proj_pooling_size=2, segmentation_module=None, pretrain_seg_module=None)

data_args: DataArguments(data_folder='/data/chenzhixuan/data/RadGenome-ChestCT/dataset/valid_preprocessed', mask_folder='/data/chenzhixuan/data/RadGenome-ChestCT/dataset/valid_region_mask', report_file='/data/chenzhixuan/data/RadGenome-ChestCT/dataset/radgenome_files/validation_region_report.csv', wrong_path='/jhcnas5/chenzhixuan/data/RadGenome-ChestCT/processed_code/wrong_files/valid_wrong_cases.json', monai_cache_dir='/jhcnas5/chenzhixuan/data/RadGenome-ChestCT/cache', data_root='./Data/data/', cap_data_path='./Data/data/M3D_Cap_npy/M3D_Cap.json', vqa_data_train_path='./Data/data/M3D-VQA/M3D_VQA_train.csv', vqa_data_val_path='./Data/data/M3D-VQA/M3D_VQA_val.csv', vqa_data_test_path='./Data/data/M3D-VQA/M3D_VQA_test.csv', vqa_yn_data_train_path='./Data/data/M3D-VQA/M3D_VQA_yn_train.csv', seg_data_path='./Data/data/M3D_Seg_npy/', refseg_data_train_path='./Data/data/M3D_RefSeg_npy/M3D_RefSeg.csv', refseg_data_test_path='./Data/data/M3D_RefSeg_npy/M3D_RefSeg_test.csv')

training_args: TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, cache_dir=None, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=nccl, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, do_eval=True, do_predict=False, do_train=False, eval_accumulation_steps=1, eval_delay=0, eval_steps=0.04, evaluation_strategy=steps, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=0.0001, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=./LaMed/output/LaMed-pretrain-test/runs/Sep23_16-07-19_jhcpu7, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1, logging_strategy=steps, lora_alpha=32, lora_bias=none, lora_dropout=0.05, lora_enable=True, lora_r=8, lora_weight_path=, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, model_max_length=2048, mp_parameters=, no_cuda=False, num_train_epochs=1, optim=adamw_torch, optim_args=None, output_dir=./LaMed/output/LaMed-pretrain-test, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=False, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=./LaMed/output/LaMed-pretrain-test, save_on_each_node=False, save_safetensors=False, save_steps=2000, save_strategy=steps, save_total_limit=2, seed=42, sharded_ddp=[], skip_memory_metrics=True, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.0, xpu_backend=None, )

Can you help me solve this problem?

baifanxxx commented 1 week ago

Hi,

I find that you set freeze_backbone=True. In this way, the gradient will be turned to False. Look here.

zhi-xuan-chen commented 1 week ago

Yes, but I let the gradient of mm_projector and LLM is true. So it can be trained.

发件人: BAI Fan @.> 日期: 星期六, 2024年9月28日 23:30 收件人: BAAI-DCAI/M3D @.> 抄送: CHEN Zhixuan @.>, Author @.> 主题: Re: [BAAI-DCAI/M3D] The strange loss from the second step during training (Issue #23)

Hi,

I find that you set freeze_backbone=True. In this way, the gradient will be turned to False. Look herehttps://github.com/BAAI-DCAI/M3D/blob/44371113bd64eb4cbc88ac9f1d925735ea589f18/LaMed/src/train/train.py#L319C9-L319C42.

― Reply to this email directly, view it on GitHubhttps://github.com/BAAI-DCAI/M3D/issues/23#issuecomment-2380709700, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AQO7ZD6GQNTKKQIBDWAGL6LZY3DRJAVCNFSM6AAAAABOVQT6QKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBQG4YDSNZQGA. You are receiving this because you authored the thread.Message ID: @.***>

BAAI-DCAI / M3D

The strange loss from the second step during training #23