Promblems about finetuning

shiym2000 commented 3 weeks ago

When I finetuned the model, I faced this problem. I have downloaded the segmentation dataset from Huggingface and unzipped them. It seems that I need more processing for the segmentation dataset. What should I do? `rank3: Traceback (most recent call last): rank3: File "/mnt/nvme_share/shiym/projects_3rd/M3D/LaMed/src/train/train.py", line 407, in

rank3: File "/mnt/nvme_share/shiym/projects_3rd/M3D/LaMed/src/train/train.py", line 394, in main

rank3: File "/home/shiym/anaconda3/envs/m3d/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train rank3: return inner_training_loop( rank3: File "/home/shiym/anaconda3/envs/m3d/lib/python3.10/site-packages/transformers/trainer.py", line 2230, in _inner_training_loop rank3: for step, inputs in enumerate(epoch_iterator): rank3: File "/home/shiym/anaconda3/envs/m3d/lib/python3.10/site-packages/accelerate/data_loader.py", line 464, in iter rank3: next_batch = next(dataloader_iter) rank3: File "/home/shiym/anaconda3/envs/m3d/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next rank3: data = self._next_data() rank3: File "/home/shiym/anaconda3/envs/m3d/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1326, in _next_data rank3: return self._process_data(data) rank3: File "/home/shiym/anaconda3/envs/m3d/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data

rank3: File "/home/shiym/anaconda3/envs/m3d/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise rank3: raise exception rank3: ValueError: Caught ValueError in DataLoader worker process 1. rank3: Original Traceback (most recent call last): rank3: File "/home/shiym/anaconda3/envs/m3d/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop rank3: data = fetcher.fetch(index) # type: ignorepossibly-undefined: File "/home/shiym/anaconda3/envs/m3d/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch rank3: data = self.dataset[idx] for idx in possibly_batched_index: File "/home/shiym/anaconda3/envs/m3d/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in rank3: data = self.dataset[idx] for idx in possibly_batched_index: File "/mnt/nvme_share/shiym/projects_3rd/M3D/LaMed/src/dataset/multi_dataset.py", line 1197, in getitem rank3: return self.datasetidx: File "/home/shiym/anaconda3/envs/m3d/lib/python3.10/site-packages/torch/utils/data/dataset.py", line 348, in getitem rank3: return self.datasets[dataset_idx]sample_idx: File "/mnt/nvme_share/shiym/projects_3rd/M3D/LaMed/src/dataset/multi_dataset.py", line 1121, in getitem rank3: return self.datasetidx: File "/home/shiym/anaconda3/envs/m3d/lib/python3.10/site-packages/torch/utils/data/dataset.py", line 348, in getitem rank3: return self.datasets[dataset_idx]sample_idx: File "/mnt/nvme_share/shiym/projects_3rd/M3D/LaMed/src/dataset/multi_dataset.py", line 907, in getitem rank3: cls_id = int(os.path.basename(segpath).split('')[1].split('.')[0]) rank3: ValueError: invalid literal for int() with base 10: '(4, 512, 512, 103)'`

baifanxxx commented 3 weeks ago

Hi,

I appreciate your interest in our work. In the segmentation task, you need to preprocess the segmentation data, including two steps. First, we should preprocess the nii.gz or other format to save to .npy, and generate a JSON split file, like here. Please note that there may not be the same in different segmentation datasets.

Second, we should preprocess these NPY images and mask to a predefined format, like here. In this step, we should resize, separate the mask channel, and generate a JSON file.

Then, we can use these preprocessed datasets in our training. When we read a pair of data from the JSON file, we can access a 3D NPY image and a One-channel mask.

If you have any questions, we are glad to reply.

shiym2000 commented 3 weeks ago

Thank you for your reply. Furthermore, if I want to use the fintuned model to finetune on my own dataset, what should I do to modify the finetune_lora.sh and train.py?

baifanxxx commented 3 weeks ago

Hi,

You should give --pretrain_mllm in finetune_lora.sh. It would be better to transfer our safetensor model to bin model. Additionally, you should not set any ----pretrain_mm_mlp_adapter.

Then, you can continue to finetune this model from ours.

shiym2000 commented 3 weeks ago

What about ----model_name_or_path, --pretrain_vision_model and --pretrain_seg_module? Can you give an example of the new finetune_lora.sh? Thank you very much.

baifanxxx commented 3 weeks ago

Hi,

please try this.

accelerate launch LaMed/src/train/train.py \
    --version v0 \
    --model_name_or_path microsoft/Phi-3-mini-4k-instruct \
    --model_type phi3 \
    --lora_enable True \
    --vision_tower vit3d \
    --pretrain_vision_model ./LaMed/pretrained_model/M3D-CLIP/pretrained_ViT.bin \
    --segmentation_module segvol \
    --pretrain_seg_module ./LaMed/pretrained_model/SegVol/pytorch_model.bin \
    --pretrain_mllm ./LaMed/output/LaMed-Phi3-4B/model.bin \
    --bf16 True \
    --output_dir ./LaMed/output/LaMed-Phi3-4B-finetune \
    --num_train_epochs 5 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "steps" \
    --eval_accumulation_steps 1 \
    --eval_steps 0.04 \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 1 \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 0.001 \
    --gradient_checkpointing False \
    --dataloader_pin_memory True\
    --dataloader_num_workers 8 \
    --report_to tensorboard

baifanxxx commented 3 weeks ago

I transfer the safetensor model to bin model here, and you can use this bin model directly.

shiym2000 commented 3 weeks ago

Thank you very much! I will have a try.

BAAI-DCAI / M3D

Promblems about finetuning #7