Closed zhi-xuan-chen closed 3 weeks ago
Hi,
I find that you set freeze_backbone=True. In this way, the gradient will be turned to False. Look here.
Yes, but I let the gradient of mm_projector and LLM is true. So it can be trained.
发件人: BAI Fan @.> 日期: 星期六, 2024年9月28日 23:30 收件人: BAAI-DCAI/M3D @.> 抄送: CHEN Zhixuan @.>, Author @.> 主题: Re: [BAAI-DCAI/M3D] The strange loss from the second step during training (Issue #23)
Hi,
I find that you set freeze_backbone=True. In this way, the gradient will be turned to False. Look herehttps://github.com/BAAI-DCAI/M3D/blob/44371113bd64eb4cbc88ac9f1d925735ea589f18/LaMed/src/train/train.py#L319C9-L319C42.
― Reply to this email directly, view it on GitHubhttps://github.com/BAAI-DCAI/M3D/issues/23#issuecomment-2380709700, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AQO7ZD6GQNTKKQIBDWAGL6LZY3DRJAVCNFSM6AAAAABOVQT6QKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBQG4YDSNZQGA. You are receiving this because you authored the thread.Message ID: @.***>
Hello, I found a strange loss during training as follow.
The loss in the first step is normal, but the loss become 0 from the second step. I only enable the gradient checkpoint for saving memory.
Here is my setting:
model_args: ModelArguments(version='v0', model_name_or_path='/jhcnas5/chenzhixuan/checkpoints/Llama-2-7b-chat-hf', model_type='llama2', freeze_backbone=True, pretrain_mllm=None, tune_mm_mlp_adapter=False, pretrain_mm_mlp_adapter=None, image_channel=1, image_size=(32, 256, 256), patch_size=(4, 16, 16), vision_tower='vit3d', vision_select_layer=-1, vision_select_feature='patch', pretrain_vision_model=None, freeze_vision_tower=False, mm_projector_type='spp', proj_layer_type='mlp', proj_layer_num=2, proj_pooling_type='spatial', proj_pooling_size=2, segmentation_module=None, pretrain_seg_module=None)
data_args: DataArguments(data_folder='/data/chenzhixuan/data/RadGenome-ChestCT/dataset/valid_preprocessed', mask_folder='/data/chenzhixuan/data/RadGenome-ChestCT/dataset/valid_region_mask', report_file='/data/chenzhixuan/data/RadGenome-ChestCT/dataset/radgenome_files/validation_region_report.csv', wrong_path='/jhcnas5/chenzhixuan/data/RadGenome-ChestCT/processed_code/wrong_files/valid_wrong_cases.json', monai_cache_dir='/jhcnas5/chenzhixuan/data/RadGenome-ChestCT/cache', data_root='./Data/data/', cap_data_path='./Data/data/M3D_Cap_npy/M3D_Cap.json', vqa_data_train_path='./Data/data/M3D-VQA/M3D_VQA_train.csv', vqa_data_val_path='./Data/data/M3D-VQA/M3D_VQA_val.csv', vqa_data_test_path='./Data/data/M3D-VQA/M3D_VQA_test.csv', vqa_yn_data_train_path='./Data/data/M3D-VQA/M3D_VQA_yn_train.csv', seg_data_path='./Data/data/M3D_Seg_npy/', refseg_data_train_path='./Data/data/M3D_RefSeg_npy/M3D_RefSeg.csv', refseg_data_test_path='./Data/data/M3D_RefSeg_npy/M3D_RefSeg_test.csv')
training_args: TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, cache_dir=None, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=nccl, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, do_eval=True, do_predict=False, do_train=False, eval_accumulation_steps=1, eval_delay=0, eval_steps=0.04, evaluation_strategy=steps, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0001,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=./LaMed/output/LaMed-pretrain-test/runs/Sep23_16-07-19_jhcpu7,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1,
logging_strategy=steps,
lora_alpha=32,
lora_bias=none,
lora_dropout=0.05,
lora_enable=True,
lora_r=8,
lora_weight_path=,
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
model_max_length=2048,
mp_parameters=,
no_cuda=False,
num_train_epochs=1,
optim=adamw_torch,
optim_args=None,
output_dir=./LaMed/output/LaMed-pretrain-test,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=,
ray_scope=last,
remove_unused_columns=False,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=./LaMed/output/LaMed-pretrain-test,
save_on_each_node=False,
save_safetensors=False,
save_steps=2000,
save_strategy=steps,
save_total_limit=2,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
Can you help me solve this problem?