LLaVA-VL / LLaVA-NeXT

Apache License 2.0
2.98k stars 255 forks source link

Where can I get the pretrained model for finetuning of LLaVA-NeXT? #218

Closed Bleking closed 2 months ago

Bleking commented 2 months ago

Hi, I am trying to finetune LLaVA-NeXT with my custom dataset, using "finetune_clip.sh" shell file.

I gave some edits to the shell for my convenience and to satisfy my task so far, like this:

export OMP_NUM_THREADS=8
export NCCL_IB_DISABLE=0
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO

LLM_VERSION="Qwen/Qwen2-7B-Instruct"
LLM_VERSION_CLEAN="${LLM_VERSION//\//_}"
VISION_MODEL_VERSION="openai/clip-vit-large-patch14-336"
VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION//\//_}"

############### Pretrain ################

PROMPT_VERSION="qwen_1_5"

BASE_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain"
echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"

NUM_GPUS=${NUM_GPUS:-2}
NNODES=${NNODES:-1}
RANK=${RANK:-0}
ADDR=${ADDR:-'localhost'}
PORT=${PORT:-'29500'}
MID_RUN_NAME=${MID_RUN_NAME:-'floorplan_vqa_1000_results'}

echo "NUM_GPUS: ${NUM_GPUS}"
echo "NNODES: ${NNODES}"
echo "RANK: ${RANK}"
echo "ADDR: ${ADDR}"
echo "PORT: ${PORT}"
echo "MID_RUN_NAME: ${MID_RUN_NAME}"

ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node="${NUM_GPUS}" --nnodes="${NNODES}" --node_rank="${RANK}" --master_addr="${ADDR}" --master_port="${PORT}" \
    llava/train/train_xformers.py \
    --deepspeed scripts/zero3_offload.json \
    --model_name_or_path ${LLM_VERSION} \
    --version ${PROMPT_VERSION} \
    --data_path='testdataset1/masters/floorplan_vqa/floorplan_vqa_1000.json' \
    --image_folder /home/work/testdataset1/LLaVA/playground/data/floorplan_data/SPA \
    --pretrain_mm_mlp_adapter="./checkpoints/open-llava-next-llama3-8b/pretrain/mm_projector.bin" \
    --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
    --mm_vision_tower_lr=2e-6 \
    --vision_tower ${VISION_MODEL_VERSION} \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --group_by_modality_length True \
    --image_aspect_ratio anyres \
    --image_grid_pinpoints "[(336, 672), (672, 336), (672, 672), (1008, 336), (336, 1008)]" \
    --mm_patch_merge_type spatial_unpad \
    --fp16 True \
    --run_name $MID_RUN_NAME \
    --output_dir "./checkpoints/${MID_RUN_NAME}" \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 3000 \
    --save_total_limit 1 \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 32768 \
    --gradient_checkpointing True \
    --dataloader_num_workers 16 \
    --lazy_preprocess True \
    --report_to wandb \
    --torch_compile True \
    --torch_compile_backend "inductor" \
    --dataloader_drop_last True \
    --attn_implementation sdpa

# You can delete the sdpa attn_implementation if you want to use flash attn

For your information, the edits include:

I had to apply xformers since I have been having trouble using flash attention in my environment.

My current problem is with the pretrained model, especially the 'pretrain_mm_mlp_adapter' argument.

Since I could not find a proper .bin model file, I just retrieved the model from https://huggingface.co/Lin-Chen/open-llava-next-llama3-8b/tree/main/pretrain that I found from this repository.

However, I constantly face some size match errors and I am stuck.

Would it be the size difference between my custom dataset and the pretrained model?

I will share with you the error message for your better understanding as well.

(llava) work@main1[s010-jiwon-thesis]:~/testdataset1/LLaVA-NeXT$ bash scripts/train/finetune_clip.sh 
BASE_RUN_NAME: llavanext-openai_clip-vit-large-patch14-336-Qwen_Qwen2-7B-Instruct-mlp2x_gelu-pretrain_blip558k_plain
NUM_GPUS: 2
NNODES: 1
RANK: 0
ADDR: localhost
PORT: 29500
MID_RUN_NAME: floorplan_vqa_1000_results
Please install pyav to use video processing functions.Please install pyav to use video processing functions.

OpenCLIP not installed
OpenCLIP not installed
[2024-09-08 09:19:06,010] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-08 09:19:06,010] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-09-08 09:19:09,817] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-09-08 09:19:09,817] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
The speedups for torchdynamo mostly come wih GPU Ampere or higher and which is not detected here.
/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Rank 0:  Overwriting config with {'use_pos_skipping': False, 'pos_skipping_range': 4096, 'mm_spatial_pool_mode': 'bilinear'}
[2024-09-08 09:19:10,116] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-09-08 09:19:10,212] [INFO] [comm.py:652:init_distributed] cdb=None
The speedups for torchdynamo mostly come wih GPU Ampere or higher and which is not detected here.
/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
main1:608321:608321 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
main1:608321:608321 [0] NCCL INFO Bootstrap : Using eth0:10.63.0.2<0>
main1:608321:608321 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
main1:608321:608321 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
main1:608321:608321 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
main1:608321:608321 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)

main1:608321:608321 [0] misc/cudawrap.cc:188 NCCL WARN Failed to find CUDA library /opt/kernel/libcuda.so (NCCL_CUDA_PATH='/opt/kernel') : /opt/kernel/libcuda.so: cannot open shared object file: No such file or directory
NCCL version 2.20.5+cuda12.4
[2024-09-08 09:19:10,599] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
main1:608321:608442 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
main1:608321:608442 [0] NCCL INFO P2P plugin IBext
main1:608321:608442 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
main1:608321:608442 [0] NCCL INFO NET/IB : No device found.
main1:608321:608442 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
main1:608321:608442 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
main1:608321:608442 [0] NCCL INFO NET/IB : No device found.
main1:608321:608442 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
main1:608321:608442 [0] NCCL INFO NET/Socket : Using [0]eth0:10.63.0.2<0>
main1:608321:608442 [0] NCCL INFO Using non-device net plugin version 0
main1:608321:608442 [0] NCCL INFO Using network Socket

main1:608322:608322 [1] misc/cudawrap.cc:188 NCCL WARN Failed to find CUDA library /opt/kernel/libcuda.so (NCCL_CUDA_PATH='/opt/kernel') : H}�
main1:608322:608322 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
main1:608322:608322 [1] NCCL INFO Bootstrap : Using eth0:10.63.0.2<0>
main1:608322:608322 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
main1:608322:608322 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
main1:608322:608322 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
main1:608322:608322 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
main1:608322:608447 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
main1:608322:608447 [1] NCCL INFO P2P plugin IBext
main1:608322:608447 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
main1:608322:608447 [1] NCCL INFO NET/IB : No device found.
main1:608322:608447 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
main1:608322:608447 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
main1:608322:608447 [1] NCCL INFO NET/IB : No device found.
main1:608322:608447 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
main1:608322:608447 [1] NCCL INFO NET/Socket : Using [0]eth0:10.63.0.2<0>
main1:608322:608447 [1] NCCL INFO Using non-device net plugin version 0
main1:608322:608447 [1] NCCL INFO Using network Socket
main1:608322:608447 [1] NCCL INFO comm 0xbbb1280 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId d8000 commId 0x97ad33e05ea99c2c - Init START
main1:608321:608442 [0] NCCL INFO comm 0xd83b8a0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 86000 commId 0x97ad33e05ea99c2c - Init START
main1:608322:608447 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000
main1:608321:608442 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000
main1:608322:608447 [1] NCCL INFO comm 0xbbb1280 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
main1:608321:608442 [0] NCCL INFO comm 0xd83b8a0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
main1:608322:608447 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
main1:608321:608442 [0] NCCL INFO Channel 00/02 :    0   1
main1:608322:608447 [1] NCCL INFO P2P Chunksize set to 131072
main1:608321:608442 [0] NCCL INFO Channel 01/02 :    0   1
main1:608321:608442 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
main1:608321:608442 [0] NCCL INFO P2P Chunksize set to 131072
main1:608322:608447 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
main1:608321:608442 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
main1:608322:608447 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
main1:608321:608442 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
main1:608322:608447 [1] NCCL INFO Connected all rings
main1:608322:608447 [1] NCCL INFO Connected all trees
main1:608321:608442 [0] NCCL INFO Connected all rings
main1:608322:608447 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
main1:608321:608442 [0] NCCL INFO Connected all trees
main1:608322:608447 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
main1:608321:608442 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
main1:608321:608442 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
main1:608322:608447 [1] NCCL INFO comm 0xbbb1280 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId d8000 commId 0x97ad33e05ea99c2c - Init COMPLETE
main1:608321:608442 [0] NCCL INFO comm 0xd83b8a0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 86000 commId 0x97ad33e05ea99c2c - Init COMPLETE
[2024-09-08 09:19:12,978] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 678, num_elems = 15.23B
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [00:08<00:00,  2.19s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [00:09<00:00,  2.37s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Rank 0:  Prompt version: qwen_1_5
Rank 0:  Loading vision tower: openai/clip-vit-large-patch14-336
[2024-09-08 09:19:23,971] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
[2024-09-08 09:19:24,105] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
[2024-09-08 09:19:24,405] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 1069, num_elems = 15.53B
/home/work/testdataset1/LLaVA-NeXT/llava/model/llava_arch.py:108: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location="cpu")
/home/work/testdataset1/LLaVA-NeXT/llava/model/llava_arch.py:108: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location="cpu")
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/work/testdataset1/LLaVA-NeXT/llava/train/train_mem.py", line 4, in <module>
[rank1]:     train()
[rank1]:   File "/home/work/testdataset1/LLaVA-NeXT/llava/train/train.py", line 1549, in train
[rank1]:     model.get_model().initialize_vision_modules(model_args=model_args, fsdp=training_args.fsdp)
[rank1]:   File "/home/work/testdataset1/LLaVA-NeXT/llava/model/llava_arch.py", line 113, in initialize_vision_modules
[rank1]:     incompatible_keys = self.mm_projector.load_state_dict(get_w(mm_projector_weights, "mm_projector"))
[rank1]:   File "/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict
[rank1]:     raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
[rank1]: RuntimeError: Error(s) in loading state_dict for Sequential:
[rank1]:        size mismatch for 0.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3584, 1024]).
[rank1]:        size mismatch for 0.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3584]).
[rank1]:        size mismatch for 2.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([3584, 3584]).
[rank1]:        size mismatch for 2.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3584]).
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/work/testdataset1/LLaVA-NeXT/llava/train/train_mem.py", line 4, in <module>
[rank0]:     train()
[rank0]:   File "/home/work/testdataset1/LLaVA-NeXT/llava/train/train.py", line 1549, in train
[rank0]:     model.get_model().initialize_vision_modules(model_args=model_args, fsdp=training_args.fsdp)
[rank0]:   File "/home/work/testdataset1/LLaVA-NeXT/llava/model/llava_arch.py", line 113, in initialize_vision_modules
[rank0]:     incompatible_keys = self.mm_projector.load_state_dict(get_w(mm_projector_weights, "mm_projector"))
[rank0]:   File "/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict
[rank0]:     raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
[rank0]: RuntimeError: Error(s) in loading state_dict for Sequential:
[rank0]:        size mismatch for 0.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3584, 1024]).
[rank0]:        size mismatch for 0.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3584]).
[rank0]:        size mismatch for 2.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([3584, 3584]).
[rank0]:        size mismatch for 2.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3584]).
W0908 09:19:26.526000 140475221702464 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 608322 closing signal SIGTERM
E0908 09:19:26.528000 140475221702464 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 608321) of binary: /home/work/anaconda3/envs/llava/bin/python
Traceback (most recent call last):
  File "/home/work/anaconda3/envs/llava/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
llava/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-08_09:19:26
  host      : main1
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 608321)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Thank you.

mylesgoose commented 2 months ago

I can give you one for the llama 3.1 version of the model https://drive.google.com/file/d/1TIBGKdDCX29YY11yDG5IafKz66jdcLq_/view?usp=sharing

Bleking commented 2 months ago

I can give you one for the llama 3.1 version of the model https://drive.google.com/file/d/1TIBGKdDCX29YY11yDG5IafKz66jdcLq_/view?usp=sharing

Thank you so much. Did you pretrain it by yourself?

However, I am still facing the same size mismatch issue despite having tried 'finetune_siglip_a4.sh' one.

[rank0]: RuntimeError: Error(s) in loading state_dict for Sequential:
[rank0]:        size mismatch for 0.weight: copying a param with shape torch.Size([4096, 1152]) from checkpoint, the shape in current model is torch.Size([3584, 1152]).
[rank0]:        size mismatch for 0.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3584]).
[rank0]:        size mismatch for 2.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([3584, 3584]).
[rank0]:        size mismatch for 2.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3584]).

I might have to check through any possible issues with my settings but can I ask you if you know what decides the shape of rhe current model we use?

mylesgoose commented 2 months ago

try to load this model see if it made any difference LLM_VERSION="mylesgoose/Meta-Llama-3.1-8B-Instruct-goose-abliterated" LLM_VERSION_CLEAN="${LLMVERSION//\//}" VISION_MODEL_VERSION="google/siglip-so400m-patch14-384" VISION_MODEL_VERSION_CLEAN="${VISION_MODELVERSION//\//}"

############### Pretrain ################

PROMPT_VERSION=plain

BASE_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain" echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"

CKPT_PATH=$LLM_VERSION

deepspeed llava/train/train_mem.py \ --deepspeed scripts/zero3.json\ --model_name_or_path ${CKPT_PATH} \ --version ${PROMPT_VERSION} \ --data_path ./data/llava_data/LLaVA-Pretrain/blip_laion_cc_sbu_558k.json \ --image_folder ./data/llava_data/LLaVA-Pretrain/images \ --pretrain_mm_mlp_adapter="./checkpoints/projectors/${BASE_RUN_NAME}/mm_projector.bin" \ --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \ --mm_vision_tower_lr=2e-6 \ --vision_tower ${VISION_MODEL_VERSION} \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --group_by_modality_length True \ --image_aspect_ratio anyres \ --image_grid_pinpoints "[(384, 768), (768, 384), (768, 768), (1152, 384), (384, 1152)]" \ --mm_patch_merge_type spatial_unpad \ --bf16 True \ --output_dir "./checkpoints/Meta-Llama-3.1-8B-Instruct-goose-abliterated-pre" \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 6 \ --gradient_accumulation_steps 20 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 500 \ --save_total_limit 2 \ --learning_rate 1e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 131072 \ --gradient_checkpointing True \ --dataloader_num_workers 6 \ --lazy_preprocess True \ --report_to wandb \ --torch_compile True \ --torch_compile_backend "inductor" \ --dataloader_drop_last True \ --attn_implementation flash_attention_2 \ --run_name ${BASE_RUN_NAME}

Bleking commented 2 months ago

try to load this model see if it made any difference LLM_VERSION="mylesgoose/Meta-Llama-3.1-8B-Instruct-goose-abliterated" LLM_VERSION_CLEAN="${LLMVERSION////}" VISION_MODEL_VERSION="google/siglip-so400m-patch14-384" VISION_MODEL_VERSION_CLEAN="${VISION_MODELVERSION////}"

############### Pretrain ################

PROMPT_VERSION=plain

BASE_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain" echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"

CKPT_PATH=$LLM_VERSION

deepspeed llava/train/train_mem.py --deepspeed scripts/zero3.json --model_name_or_path ${CKPT_PATH} --version ${PROMPT_VERSION} --data_path ./data/llava_data/LLaVA-Pretrain/blip_laion_cc_sbu_558k.json --image_folder ./data/llava_data/LLaVA-Pretrain/images --pretrain_mm_mlp_adapter="./checkpoints/projectors/${BASE_RUN_NAME}/mm_projector.bin" --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" --mm_vision_tower_lr=2e-6 --vision_tower ${VISION_MODEL_VERSION} --mm_projector_type mlp2x_gelu --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --group_by_modality_length True --image_aspect_ratio anyres --image_grid_pinpoints "[(384, 768), (768, 384), (768, 768), (1152, 384), (384, 1152)]" --mm_patch_merge_type spatial_unpad --bf16 True --output_dir "./checkpoints/Meta-Llama-3.1-8B-Instruct-goose-abliterated-pre" --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 6 --gradient_accumulation_steps 20 --evaluation_strategy "no" --save_strategy "steps" --save_steps 500 --save_total_limit 2 --learning_rate 1e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 131072 --gradient_checkpointing True --dataloader_num_workers 6 --lazy_preprocess True --report_to wandb --torch_compile True --torch_compile_backend "inductor" --dataloader_drop_last True --attn_implementation flash_attention_2 --run_name ${BASE_RUN_NAME}

Sorry for the late reply. However, this setting ended up giving me a CUDA out of memory error. I am trying several settings but I still can't find a nice one.

For your information, the resource group of my server is RTXQ, there are two GPUs with it, 16 cores, and 160.00 GiB memories.

I will also share with you the 'nvidia-smi' output.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  CUDA GPU                       On  | 00000000:86:00.0 Off |                  Off |
| N/A   26C    P8              12W / 250W |     48MiB / 23040MiB |     N/A      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  CUDA GPU                       On  | 00000000:D8:00.0 Off |                  Off |
| N/A   25C    P8              13W / 250W |     48MiB / 23040MiB |     N/A      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
mylesgoose commented 2 months ago

I think this means you managed to load the model, and the problem is you're system resources are not enough to train. for example, I have 7 rtx 4090 each with 24gb ram and 256gb system ram. and this is just enough to do the train with the settings I showed you. when I had 6 cards it would say cuda out of memory. if I tried to ofload to cpu ram. then that would use over 260gb ram. and also the gpu ram. so you can only train with qlora or lora maybe. not full train.

Bleking commented 2 months ago

I think this means you managed to load the model, and the problem is you're system resources are not enough to train. for example, I have 7 rtx 4090 each with 24gb ram and 256gb system ram. and this is just enough to do the train with the settings I showed you. when I had 6 cards it would say cuda out of memory. if I tried to ofload to cpu ram. then that would use over 260gb ram. and also the gpu ram. so you can only train with qlora or lora maybe. not full train.

Thank you for the advice. let me add the following lines in the 'finetune_onevision.sh'. --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5

Also, do you know any way to disable FlashAttention for this? Because my environment does not support Ampere GPUs. Should I just add the --attn_implementation sdpa argument file and replace 'train_mem.py' with 'train_xformers' in the shell file?


export OMP_NUM_THREADS=8
export NCCL_IB_DISABLE=0
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO

LLM_VERSION="Qwen/Qwen2-0.5B-Instruct" 
# for 7b model we recommend bs=1, accum=2, 16 nodes, 128 gpus, lr=1e-5, warmup=0.03
# for 72b model we recommend bs=1, accum=1, 32 nodes, 256 gpus, lr=1e-5, warmup=0.03
LLM_VERSION_CLEAN="${LLM_VERSION//\//_}"
VISION_MODEL_VERSION="google/siglip-so400m-patch14-384"
VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION//\//_}"

############### Pretrain ################

PROMPT_VERSION="qwen_1_5"

BASE_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain"
echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"

NUM_GPUS=${NUM_GPUS:-2}
NNODES=${NNODES:-1}
RANK=${RANK:-0}
ADDR=${ADDR:-'localhost'}
PORT=${PORT:-'29500'}
MID_RUN_NAME=${MID_RUN_NAME:-'floorplan_vqa_1000_results'}

echo "NUM_GPUS: ${NUM_GPUS}"
echo "NNODES: ${NNODES}"
echo "RANK: ${RANK}"
echo "ADDR: ${ADDR}"
echo "PORT: ${PORT}"
echo "MID_RUN_NAME: ${MID_RUN_NAME}"

CKPT_PATH=$LLM_VERSION # this could also be the previous stage checkpoint

ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node="${NUM_GPUS}" --nnodes="${NNODES}" --node_rank="${RANK}" --master_addr="${ADDR}" --master_port="${PORT}" \
    llava/train/train_xformers.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed scripts/zero3.json \
    --model_name_or_path ${CKPT_PATH} \
    --version ${PROMPT_VERSION} \
    --data_path ./playground/floorplan_vqa_1000.json \
    --image_folder /home/work/testdataset1/LLaVA/playground/data/floorplan_data/ \
    --pretrain_mm_mlp_adapter="./checkpoints/llava-onevision-projectors/0.5b/mm_projector.bin" \
    --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
    --mm_vision_tower_lr=2e-6 \
    --vision_tower ${VISION_MODEL_VERSION} \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --group_by_modality_length True \
    --image_aspect_ratio anyres_max_9 \
    --image_grid_pinpoints  "(1x1),...,(6x6)" \
    --mm_patch_merge_type spatial_unpad \
    --fp16 True \
    --run_name $MID_RUN_NAME \
    --output_dir "./checkpoints/{$MID_RUN_NAME}" \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 32 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 1 \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 32768 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb \
    --torch_compile True \
    --torch_compile_backend "inductor" \
    --dataloader_drop_last True \
    --frames_upbound 32 \
    --attn_implementation sdpa

# You can delete the sdpa attn_implementation if you want to use fla
```sh attn
mylesgoose commented 2 months ago

just download the flash attention source code and compile it

sahilqure commented 2 months ago

@mylesgoose Can u give ur pretraining and finetuning code with llama 3.1 . Please push it in your repo. Were u facing tokenization mismatch error while writing the preprocess function?

mylesgoose commented 2 months ago

@mylesgoose Can u give ur pretraining and finetuning code with llama 3.1 . Please push it in your repo. Were u facing tokenization mismatch error while writing the preprocess function? I posted the pretrain code above. however to run that i had to use local files only and clone the repo to the actual directory of llava next. for example : i had to create a fodleer called mylesgoose and place the model in the same place. the csritp then automatically loads the model from there intsead of the hub. however it still trys to load the llam3.0 tokenizer if you choose promot type or conversation template plain to llava_llama_3 , in the fine tune run i think i modified this file so that it calls that tokenizer when the model is loaded. the file is called conversation.py conv_llava_llama_3 = Conversation( system="You are a helpful language and vision, AI assistant. " "your prompt here!", roles=("user", "assistant"), version="llama_v3", messages=[], offset=0, sep="<|eot_id|>", sep_style=SeparatorStyle.LLAMA_3, tokenizer_id="mylesgoose/Meta-Llama-3.1-8B-Instruct-goose-abliterated-pre-llava", tokenizer=safe_load_tokenizer("mylesgoose/Meta-Llama-3.1-8B-Instruct-goose-abliterated-pre-llava"), stop_token_ids=[128009], )

mylesgoose commented 2 months ago

@mylesgoose Can u give ur pretraining and finetuning code with llama 3.1 . Please push it in your repo. Were u facing tokenization mismatch error while writing the preprocess function? I posted the pretrain code above. however to run that i had to use local files only and clone the repo to the actual directory of llava next. for example : i had to create a fodleer called mylesgoose and place the model in the same place. the csritp then automatically loads the model from there intsead of the hub. however it still trys to load the llam3.0 tokenizer if you choose promot type or conversation template plain to llava_llama_3 , in the fine tune run i think i modified this file so that it calls that tokenizer when the model is loaded. the file is called conversation.py conv_llava_llama_3 = Conversation( system="You are a helpful language and vision, AI assistant. " "your prompt here!", roles=("user", "assistant"), version="llama_v3", messages=[], offset=0, sep="<|eot_id|>", sep_style=SeparatorStyle.LLAMA_3, tokenizer_id="mylesgoose/Meta-Llama-3.1-8B-Instruct-goose-abliterated-pre-llava", tokenizer=safe_load_tokenizer("mylesgoose/Meta-Llama-3.1-8B-Instruct-goose-abliterated-pre-llava"), stop_token_ids=[128009], )

You also have to ensure you have transformers version higher than 4.42 or something. the couple of pip programs that where having errors i downloaded the source and compiled from source. the transformers version you have installed will be the older verison and that will be saying tokenization error. myles@ubuntu11:~/LLaVA-NeXT$ pip3 list Package Version Editable project location


absl-py 2.1.0 accelerate 0.34.2 aiofiles 22.1.0 aiohappyeyeballs 2.4.0 aiohttp 3.10.5 aiosignal 1.3.1 aiosqlite 0.20.0 altair 5.4.1 annotated-types 0.7.0 anyio 4.4.0 appdirs 1.4.4 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 arrow 1.3.0 asttokens 2.4.1 astunparse 1.6.3 async-lru 2.0.4 async-timeout 4.0.3 attrs 24.2.0 audioread 3.0.1 av 13.0.0 babel 2.16.0 beartype 0.14.1 beautifulsoup4 4.12.3 better-abc 0.0.3 bidict 0.23.1 bitsandbytes 0.43.3 black 24.1.0 bleach 6.1.0 Brotli 1.1.0 cachetools 5.5.0 certifi 2024.8.30 cffi 1.17.1 cfgv 3.4.0 chardet 5.2.0 charset-normalizer 3.3.2 click 8.1.7 cmake 3.30.2 colorama 0.4.6 comm 0.2.2 contourpy 1.3.0 crcmod 1.7 cryptography 43.0.1 cuda-python 12.4.0 /home/myles/cuda-python-12.4.0 cycler 0.12.1 Cython 3.0.11 DataProperty 1.0.1 datasets 2.16.1 debugpy 1.8.5 decorator 5.1.1 decord 0.6.0 deepspeed 0.15.2+fc22d960 deepspeed-kernels 0.0.1.dev1698255861 defusedxml 0.7.1 Deprecated 1.2.14 dill 0.3.7 distlib 0.3.8 distro 1.9.0 dnspython 2.6.1 docker-pycreds 0.4.0 docopt 0.6.2 docstring_parser 0.16 e 1.4.5 einops 0.8.0 einops-exts 0.0.4 entrypoints 0.4 et-xmlfile 1.1.0 eval_type_backport 0.2.0 evaluate 0.4.2 exceptiongroup 1.2.2 executing 2.1.0 fancy-einsum 0.0.3 fastapi 0.112.4 fastjsonschema 2.20.0 ffmpeg-python 0.2.0 ffmpy 0.4.0 filelock 3.16.0 flash_attn 2.6.3 flatbuffers 24.3.25 fonttools 4.53.1 fqdn 1.5.1 frozenlist 1.4.1 fsspec 2023.10.0 ftfy 6.2.3 future 1.0.0 gast 0.6.0 gitdb 4.0.11 GitPython 3.1.43 google-pasta 0.2.0 gradio 4.43.0 gradio_client 1.3.0 graphviz 0.20.3 grpcio 1.66.1 h11 0.14.0 h5py 3.11.0 hf_transfer 0.1.8 hjson 3.1.0 httpcore 1.0.5 httpx 0.27.2 huggingface-hub 0.24.6 identify 2.6.0 idna 3.8 importlib_metadata 8.4.0 importlib_resources 6.4.4 iniconfig 2.0.0 ipaddress 1.0.23 ipykernel 6.29.5 ipython 8.27.0 ipython-genutils 0.2.0 ipywidgets 8.1.5 isoduration 20.11.0 isort 5.13.2 jaxtyping 0.2.34 jedi 0.19.1 Jinja2 3.1.4 jiter 0.5.0 joblib 1.4.2 json5 0.9.25 jsonlines 4.0.0 jsonpointer 3.0.0 jsonschema 4.23.0 jsonschema-specifications 2023.12.1 jupyter 1.1.1 jupyter_client 8.6.2 jupyter-console 6.6.3 jupyter_core 5.7.2 jupyter-events 0.10.0 jupyter-lsp 2.2.5 jupyter_server 2.14.2 jupyter_server_fileid 0.9.3 jupyter_server_terminals 0.5.3 jupyter_server_ydoc 0.8.0 jupyter-ydoc 0.3.4 jupyterlab 4.2.5 jupyterlab_pygments 0.3.0 jupyterlab_server 2.27.3 jupyterlab_widgets 3.0.13 keras 3.5.0 kiwisolver 1.4.7 latex2mathml 3.77.0 lazy_loader 0.4 Levenshtein 0.25.1 libclang 18.1.1 librosa 0.10.2.post1 linkify-it-py 2.0.3 llava 1.7.0.dev0 /home/myles/LLaVA-NeXT llvmlite 0.43.0 lmms_eval 0.2.3 /home/myles/lmms-eval loguru 0.7.2 lxml 5.3.0 Markdown 3.7 markdown-it-py 3.0.0 markdown2 2.5.0 MarkupSafe 2.1.5 matplotlib 3.9.2 matplotlib-inline 0.1.7 mbstrdecoder 1.1.3 mdit-py-plugins 0.4.1 mdurl 0.1.2 mistune 3.0.2 ml-dtypes 0.4.0 mpmath 1.3.0 msgpack 1.0.8 multidict 6.0.5 multiprocess 0.70.15 mutagen 1.47.0 mypy-extensions 1.0.0 namex 0.0.8 narwhals 1.6.2 nbclassic 1.1.0 nbclient 0.10.0 nbconvert 7.16.4 nbformat 5.10.4 nest-asyncio 1.6.0 networkx 3.3 ninja 1.11.1.1 nltk 3.9.1 nodeenv 1.9.1 notebook 7.2.2 notebook_shim 0.2.4 num2words 0.5.13 numba 0.60.0 numexpr 2.10.1 numpy 1.26.4 nvidia-cublas-cu12 12.4.5.8 nvidia-cuda-cupti-cu12 12.4.127 nvidia-cuda-nvrtc-cu12 12.4.127 nvidia-cuda-runtime-cu12 12.4.127 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.2.1.3 nvidia-curand-cu12 10.3.5.147 nvidia-cusolver-cu12 11.6.1.9 nvidia-cusparse-cu12 12.3.1.170 nvidia-cutlass 3.5.1.0 /home/myles/cutlass nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.21.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.4.127 nvidia-pyindex 1.0.9 open_clip_torch 2.26.1 openai 1.44.0 opencv-python 4.10.0.84 opencv-python-headless 4.10.0.84 openpyxl 3.1.5 opt-einsum 3.3.0 optree 0.12.1 orjson 3.10.7 overrides 7.7.0 packaging 24.1 pandas 2.2.2 pandocfilters 1.5.1 parso 0.8.4 pathlib2 2.3.7.post1 pathspec 0.12.1 pathvalidate 3.2.1 peft 0.12.0 pexpect 4.9.0 Pillow 10.1.0 pip 24.2 platformdirs 4.3.1 pluggy 1.5.0 ply 3.11 pooch 1.8.2 portalocker 2.10.1 pre-commit 3.8.0 prometheus_client 0.20.0 promise 2.3 prompt_toolkit 3.0.47 protobuf 4.25.4 psutil 6.0.0 ptyprocess 0.7.0 pure_eval 0.2.3 py 1.11.0 py-cpuinfo 9.0.0 py-spy 0.3.14 pyarrow 17.0.0 pyarrow-hotfix 0.6 pybind11 2.13.5 pycocoevalcap 1.2 pycocotools 2.0.8 pycparser 2.22 pycryptodomex 3.20.0 pydantic 2.9.0 pydantic_core 2.23.2 pydot 3.0.1 pydub 0.25.1 Pygments 2.18.0 PyJWT 2.9.0 pynndescent 0.5.13 pynvml 11.5.3 pyOpenSSL 24.2.1 pyparsing 3.1.4 pyproject-api 1.7.1 pytablewriter 1.2.0 pytest 8.3.2 python-consul 1.1.0 python-dateutil 2.9.0.post0 python-engineio 4.9.1 python-etcd 0.4.5 python-json-logger 2.0.7 python-multipart 0.0.9 python-socketio 5.11.4 pytorch-triton 3.0.0+757b6a61e7 pytz 2024.1 PyYAML 6.0.2 pyzmq 26.2.0 qtconsole 5.6.0 QtPy 2.4.1 rapidfuzz 3.9.7 referencing 0.35.1 regex 2024.7.24 requests 2.32.3 responses 0.25.3 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rich 13.8.0 ring_flash_attn 0.1 /home/myles/ring-flash-attention rouge_score 0.1.2 rpds-py 0.20.0 ruff 0.6.4 sacrebleu 2.4.3 safetensors 0.4.5 schedule 1.2.2 scikit-learn 1.5.1 scipy 1.14.1 semantic-version 2.10.0 Send2Trash 1.8.3 sentencepiece 0.2.0 sentry-sdk 2.13.0 setproctitle 1.3.3 setuptools 70.2.0 shellingham 1.5.4 shortuuid 1.0.13 shtab 1.7.1 simple-websocket 1.0.0 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 sounddevice 0.5.0 soundfile 0.12.1 soupsieve 2.6 soxr 0.5.0.post1 sqlitedict 2.1.0 stack-data 0.6.3 starlette 0.38.4 svgwrite 1.4.3 sympy 1.13.1 tabledata 1.3.3 tabulate 0.9.0 tcolorpy 0.1.6 tenacity 9.0.0 tensorboard 2.17.1 tensorboard-data-server 0.7.2 tensorflow 2.17.0 termcolor 2.4.0 terminado 0.18.1 threadpoolctl 3.5.0 thriftpy2 0.5.2 tiktoken 0.7.0 timm 1.0.9 tinycss2 1.3.0 tokenizers 0.19.1 toml 0.10.2 tomli 2.0.1 tomlkit 0.12.0 toolz 0.12.1 torch 2.5.0.dev20240907+cu124 torchaudio 2.5.0.dev20240907+cu124 torchvision 0.20.0.dev20240907+cu124 tornado 6.4.1 tox 4.18.1 tqdm 4.66.5 tqdm-multiprocess 0.0.11 traitlets 5.14.3 transformer-lens 2.4.1 transformers 4.45.0.dev0 /home/myles/transformers transformers-stream-generator 0.0.5 treelib 1.7.0 triton 3.0.0 typeguard 2.13.3 typepy 1.3.2 typer 0.12.5 types-python-dateutil 2.9.0.20240906 typing_extensions 4.12.2 tyro 0.8.10 tzdata 2024.1 uc-micro-py 1.0.3 umap-learn 0.5.6 Unidecode 1.3.8 uri-template 1.3.0 urllib3 2.2.2 uvicorn 0.30.6 virtualenv 20.26.4 wandb 0.17.9 watchdog 5.0.2 wavedrom 2.0.3.post3 wcwidth 0.2.13 webcolors 24.8.0 webencodings 0.5.1 websocket-client 1.8.0 websockets 12.0 Werkzeug 3.0.4 wheel 0.44.0 widgetsnbextension 4.0.13 wrapt 1.16.0 wsproto 1.2.0 xxhash 3.5.0 y-py 0.6.2 yarl 1.10.0 ypy-websocket 0.8.4 yt-dlp 2024.8.6 zipp 3.20.1 zss 1.2.0 zstandard 0.23.0

mylesgoose commented 2 months ago

@mylesgoose Can u give ur pretraining and finetuning code with llama 3.1 . Please push it in your repo. Were u facing tokenization mismatch error while writing the preprocess function?

So to try help you out I completely removed LLava next. I then downloaded the git hub repo i installed the pip apps above. there is an issue with that version of tensorflow so dont install it.. i modified only that conversation.py to point to that prelalva repo mentioned in my last message. i ran the train with synthdog_en json for only 5 rounds. and becuase i used the model with the visoon encoder already isntalled it made things much simplere. hugginface loaded that repo abliterated-pre-llava and it clearly has used the correct tokenism. because i tested the model at checkpoint 5 and its producing coherent text. and that has loaded direclty from the hub. `LLM_VERSION="mylesgoose/Meta-Llama-3.1-8B-Instruct-goose-abliterated-pre-llava" LLM_VERSION_CLEAN="${LLMVERSION//\//}" VISION_MODEL_VERSION="google/siglip-so400m-patch14-384" VISION_MODEL_VERSION_CLEAN="${VISION_MODELVERSION//\//}"

############### Pretrain ################ PROMPT_VERSION=llava_llama_3

BASE_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain" echo "BASE_RUN_NAME: ${BASE_RUN_NAME}" PRE_RUN_NAME="${BASE_RUN_NAME}-synthdog_en" CKPT_PATH=$LLM_VERSION

accelerate launch llava/train/train_mem.py \ --deepspeed scripts/zero3.json\ --model_name_or_path ${CKPT_PATH} \ --version ${PROMPT_VERSION} \ --data_path ./data/synthdog_en/synthdog_en_processed.json \ --image_folder ./data/synthdog_en \ --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \ --mm_vision_tower_lr=2e-6 \ --vision_tower ${VISION_MODEL_VERSION} \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --group_by_modality_length True \ --image_aspect_ratio anyres \ --image_grid_pinpoints "[(384, 768), (768, 384), (768, 768), (1152, 384), (384, 1152)]" \ --mm_patch_merge_type spatial_unpad \ --bf16 True \ --output_dir "./checkpoints/${PRE_RUN_NAME}" \ --num_train_epochs 1 \ --per_device_train_batch_size 6 \ --per_device_eval_batch_size 0 \ --gradient_accumulation_steps 6 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 5 \ --save_total_limit 2 \ --learning_rate 1e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --gradient_checkpointing True \ --dataloader_num_workers 2 \ --lazy_preprocess True \ --report_to wandb \ --torch_compile True \ --torch_compile_backend "inductor" \ --dataloader_drop_last True \ --attn_implementation flash_attention_2 \ --run_name ${PRE_RUN_NAME} from llava.model.builder import load_pretrained_model from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX from llava.conversation import conv_templates, SeparatorStyle

from PIL import Image import requests import copy import torch

pretrained = "./checkpoints/llavanext-google_siglip-so400m-patch14-384-mylesgoose_Meta-Llama-3.1-8B-Instruct-goose-abliterated-pre-llava-mlp2x_gelu-pretrain_blip558k_plain-synthdog_en/checkpoint-5" model_name = "llava_llama3" device = "cuda" device_map = "auto" tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args

model.eval() model.tie_weights() image = Image.open("/home/myles/Desktop/extreme_ironing.jpg") image_tensor = process_images([image], image_processor, model.config) image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]

conv_template = "llava_llama_3" question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image? Is there anything strange about this image? Is this normal behaviour." conv = copy.deepcopy(conv_templates[conv_template]) conv.append_message(conv.roles[0], question) conv.append_message(conv.roles[1], None) prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device) image_sizes = [image.size]

cont = model.generate( input_ids, images=image_tensor, image_sizes=image_sizes, do_sample=True, temperature=0.9, max_new_tokens=10000, ) text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True) print(text_outputs) Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████| 4/4 [00:08<00:00, 2.15s/it] Model Class: LlavaLlamaForCausalLM The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:None for open-end generation. The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Starting from v4.46, the logits model output will have the same type as the model (except at train time, where it will always be FP32) ['A man is ironing clothes on a yellow taxi cab, while it is parked in the middle of the road. This is the normal behavior, not the strange one.'] myles@ubuntu11:~/LLaVA-NeXT$ `

Bleking commented 2 months ago

Hi @mylesgoose. I hope you are doing well. I recently finetuned the LLaVA-NeXT with your pretrained model. However, I was not able to get a nice evaluation result.

python3 -m accelerate.commands.launch \ --num_processes=4 \ -m lmms_eval \ --model llava \ --model_args pretrained=/home/work/testdataset1/LLaVA-NeXT/checkpoints/results-NeXT/Meta-Llama-3.1-8B-Instruct-goose-abliterated-pre/checkpoint-7,conv_template=llava_llama_3,device=cuda \ --tasks floorplan_test_wilder \ --batch_size 1 \ --log_samples \ --log_samples_suffix llava_next \ --output_path ./logs/

This was my command for the evaluation, having used "llava" as the 'model', and set "llava_llama_3" for the 'conv_template' argument. Plus, I had to copy the 'tokenizer.json' file to the "checkpoint-7" directory since the evaluation process require both tokenizer and config.

image

Since your pretrained model is based on LLaMA-3.1, I would like to check if you had used the same 'conv_template' as I did or not.

mylesgoose commented 2 months ago

I think you used the wrong tokenizer. for llama 3.0. do you want me to send you the llama 3.1 version. if you look inside the conversation.py it shows it does not really load the tokenizer from the model directory. it loads from a hugging face link and it also sets parameters for the llama 3.0 model. llava_llama_3 points to meta llama 3.0 8b. but also I noticed you only did 7 steps so also that is early in your run. I found a model yesterday that fits nicely Into 24gb ram and 130gb cpu ram for training. which might be fun for you as will fit into your vram. I trained the vision part last night. I uploaded a chat template here mylesgoose/Llama-3.1-Minitron-4B-Width-Base I thought that model might be good for your ram. the conversation.py is in the vision folder. it has llava_llama_3_1

Bleking commented 2 months ago

I think you used the wrong tokenizer. for llama 3.0. do you want me to send you the llama 3.1 version. if you look inside the conversation.py it shows it does not really load the tokenizer from the model directory. it loads from a hugging face link and it also sets parameters for the llama 3.0 model. llava_llama_3 points to meta llama 3.0 8b. but also I noticed you only did 7 steps so also that is early in your run. I found a model yesterday that fits nicely Into 24gb ram and 130gb cpu ram for training. which might be fun for you as will fit into your vram. I trained the vision part last night. I uploaded a chat template here mylesgoose/Llama-3.1-Minitron-4B-Width-Base I thought that model might be good for your ram. the conversation.py is in the vision folder. it has llava_llama_3_1

Yes, please. It would be a pleasure.

Also, if you are okay, may I get some advice on finetuning this LLaVA-NeXT? Neither did I get the similarly poor results when I evaluated my custom test data with the unfinetuned model, liuhaotian/llava-v1.5-7b, which LLaVA-NeXT uses as its pretrained model according to llava.py from lmms-eval, but I still guess my script for the training has some issues, and this might be why I constantly receive a string of exclamation marks for the generated answers and get a strange result at the end.

Thank you.

mylesgoose commented 2 months ago

i think you trained the model with the wrong tokenism and chat prompt template. https://huggingface.co/mylesgoose/Llama-3.1-Minitron-4B-Width-Base/blob/main/vision/conversation.py

Bleking commented 2 months ago

i think you trained the model with the wrong tokenism and chat prompt template. https://huggingface.co/mylesgoose/Llama-3.1-Minitron-4B-Width-Base/blob/main/vision/conversation.py

Thank you for sharing this. What do I have to do with this file? I tried to read what you have written above but I had some difficulty understanding it. Should I just place the folder "Llama-3.1-Minitron-4B-Width-Base" in the same folder in which I save my finetuned result? I guess I can simply use the 'mm_projector.bin' file you had already given me. Plus, for your information, my server is now using NVIDIA RTX6000 model with 4 GPUs, RAM 320GiB, and 32 CPU cores.

Let me share with you the picture of my checkpoint folder again to make things clear and you may be able to explain with more ease. I am going to keep saving the results to the "results-NeXT" and the result is stored in "Meta-Llama-3.1-8B-Instruct-goose-abliterated-pre". image

mylesgoose commented 2 months ago

48gb each vram times 4. that's pretty good. I think you will be able to do a full train rather than lora. I had allot of trouble with the train.py and the prompt template. and the conversation.py files that are in the llava repos. For example if you select prompt version plain or llama_3 it uses the tokenizer from meta llama 3.0 not 3.1. it also grabbed the chat template from that model. so I had to adjust the file I showed you above. with that file you can test if the model is tokenizing properly. with the chat template you have selected. Also I trained a model today that nvidia one. and it was outputting ### in front of its messages and ending its messages with ### and I see in the train.py there is a preprocess for each message in the json to add those delimiters. it seems the model quickly learned to speak with those in its outputs. I think llama is very sensitive to the inputs. for some reason it did this with one model but not another. something we need to figure out I guess.

mylesgoose commented 2 months ago

I think we need to modify the train.py to something like this

'''def preprocess_llama_3_1( sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False ) -> Dict: conv = conversation_lib.default_conversation.copy() roles = {"human": conv.roles[0], "gpt": conv.roles[1]}

# Apply prompt templates
conversations = []
for i, source in enumerate(sources):
    if roles[source[0]["from"]] != conv.roles[0]:
        # Skip the first one if it is not from human
        source = source[1:]

    conv.messages = []
    for j, sentence in enumerate(source):
        role = roles[sentence["from"]]
        assert role == conv.roles[j % 2], f"{i}"
        conv.append_message(role, sentence["value"])
    conversations.append(conv.get_prompt())

# Tokenize conversations

if has_image:
    input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in conversations], dim=0)
else:
    input_ids = tokenizer(
        conversations,
        return_tensors="pt",
        padding="longest",
        max_length=tokenizer.model_max_length,
        truncation=True,
    ).input_ids

# remove the first bos token
if input_ids[0][0] == input_ids[0][1] == tokenizer.bos_token_id:
    input_ids = input_ids[:, 1:]
targets = input_ids.clone()

assert conv.sep_style == conversation_lib.SeparatorStyle.LLAMA_3_1

# Mask targets
sep= '<|start_header_id|>' + conv.roles[1] + '<|end_header_id|>' + '\n\n'
#sep = conv.sep + conv.roles[1] + ": "
for conversation, target in zip(conversations, targets):
    total_len = int(target.shape[0])

    rounds = conversation.split(conv.tokenizer.eos_token)
    rounds= [rounds[0]] + [rounds[idx] + rounds[idx+1] for idx in range(1, len(rounds)-1, 2)]

    cur_len = 1
    target[:cur_len] = IGNORE_INDEX
    for i, rou in enumerate(rounds):
        if rou == "":
            break

        parts = rou.split(sep)
        if len(parts) != 2 and i != 0:
            break

        if i == 0:
            round_len = len(tokenizer(rou, add_special_tokens=False).input_ids)
            instruction_len = len(tokenizer(rou, add_special_tokens=False).input_ids)

        else:
            parts[0] += sep
            if has_image:
                round_len = len(tokenizer_image_token(rou, tokenizer)) + 1
                instruction_len = len(tokenizer_image_token(parts[0], tokenizer))
            else:
                round_len = len(tokenizer(rou).input_ids) + 1
                instruction_len = len(tokenizer(parts[0]).input_ids)

        # if i > 0: round_len += 1
        target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
        cur_len += round_len

    target[cur_len:] = IGNORE_INDEX
    cur_len= cur_len + len(tokenizer(sep, add_special_tokens=False).input_ids)

    # if cur_len > tokenizer.model_max_length: print(f"WARNING: max length context")
    if cur_len < tokenizer.model_max_length:
        if cur_len != total_len:
            target[:] = IGNORE_INDEX
            print(
                f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
                f" (ignored)"
            )

return dict(
    input_ids=input_ids,
    labels=targets,
)

'''

mylesgoose commented 2 months ago

the issue is how the training is applying the tokens etc to the model we training. its confusing the model with ### etc have a look here here and here here and see how the line delimiters are setup for llama3.1

in your train.sh script export TOKENIZER_PATH= your model name ie mylesgoose/Meta.... ` in the conversation.py elif self.sep_style == SeparatorStyle.LLAMA_3_1: chat_template_messages = [{"role": "system", "content": self.system}] for role, message in messages: if message: if type(message) is tuple: message, images = message message = "" * len(images) + message chat_template_messages.append({"role": role, "content": message})

        return self.tokenizer.apply_chat_template(chat_template_messages, tokenize=False, add_generation_prompt=False)

tokenizer_path= os.getenv("TOKENIZER_PATH") llama_tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)

conv_llava_llama_3_1 = Conversation( system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.",

roles=("<|start_header_id|>user", "<|start_header_id|>assistant"),

roles=("user", "assistant"),
version="llama_3_1",
messages=[],
offset=0,
sep_style=SeparatorStyle.LLAMA_3_1,
tokenizer=llama_tokenizer,
stop_token_ids=[128009, 128008, 128001],

)

and in the train.py

def preprocess_llama_3_1( sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False ) -> Dict: conv = conversation_lib.default_conversation.copy() roles = {"human": conv.roles[0], "gpt": conv.roles[1]}

# Apply prompt templates
conversations = []
for i, source in enumerate(sources):
    if roles[source[0]["from"]] != conv.roles[0]:
        # Skip the first one if it is not from human
        source = source[1:]

    conv.messages = []
    for j, sentence in enumerate(source):
        role = roles[sentence["from"]]
        assert role == conv.roles[j % 2], f"{i}"
        conv.append_message(role, sentence["value"])
    conversations.append(conv.get_prompt())

# Tokenize conversations

if has_image:
    input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in conversations], dim=0)
else:
    input_ids = tokenizer(
        conversations,
        return_tensors="pt",
        padding="longest",
        max_length=tokenizer.model_max_length,
        truncation=True,
    ).input_ids

# remove the first bos token
if input_ids[0][0] == input_ids[0][1] == tokenizer.bos_token_id:
    input_ids = input_ids[:, 1:]
targets = input_ids.clone()

assert conv.sep_style == conversation_lib.SeparatorStyle.LLAMA_3_1

# Mask targets
sep= '<|start_header_id|>' + conv.roles[1] + '<|end_header_id|>' + '\n\n'
#sep = conv.sep + conv.roles[1] + ": "
for conversation, target in zip(conversations, targets):
    total_len = int(target.shape[0])

    rounds = conversation.split(conv.tokenizer.eos_token)
    rounds= [rounds[0]] + [rounds[idx] + rounds[idx+1] for idx in range(1, len(rounds)-1, 2)]

    cur_len = 1
    target[:cur_len] = IGNORE_INDEX
    for i, rou in enumerate(rounds):
        if rou == "":
            break

        parts = rou.split(sep)
        if len(parts) != 2 and i != 0:
            break

        if i == 0:
            round_len = len(tokenizer(rou, add_special_tokens=False).input_ids)
            instruction_len = len(tokenizer(rou, add_special_tokens=False).input_ids)

        else:
            parts[0] += sep
            if has_image:
                round_len = len(tokenizer_image_token(rou, tokenizer)) + 1
                instruction_len = len(tokenizer_image_token(parts[0], tokenizer))
            else:
                round_len = len(tokenizer(rou).input_ids) + 1
                instruction_len = len(tokenizer(parts[0]).input_ids)

        # if i > 0: round_len += 1
        target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
        cur_len += round_len

    target[cur_len:] = IGNORE_INDEX
    cur_len= cur_len + len(tokenizer(sep, add_special_tokens=False).input_ids)

    # if cur_len > tokenizer.model_max_length: print(f"WARNING: max length context")
    if cur_len < tokenizer.model_max_length:
        if cur_len != total_len:
            target[:] = IGNORE_INDEX
            print(
                f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
                f" (ignored)"
            )

return dict(
    input_ids=input_ids,
    labels=targets,
)

asloin the train.py def preprocess( sources: Sequence[str], tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False ) -> Dict: """ Given a list of sources, each is a conversation list. This transform:

  1. Add signal '### ' at the beginning each sentence, with end signal '\n';
  2. Concatenate conversations together;
  3. Tokenize the concatenated conversation;
  4. Make a deepcopy as the target. Mask human words with IGNORE_INDEX. """ if conversation_lib.default_conversation.sep_style == conversation_lib.SeparatorStyle.PLAIN: return preprocess_plain(sources, tokenizer) if conversation_lib.default_conversation.sep_style == conversation_lib.SeparatorStyle.LLAMA_3: return preprocess_llama_3(sources, tokenizer, has_image=has_image) if conversation_lib.default_conversation.sep_style == conversation_lib.SeparatorStyle.LLAMA_3_1: return preprocess_llama_3_1(sources, tokenizer, has_image=has_image) `
mylesgoose commented 2 months ago

`def preprocess_llama_3_1( sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False ) -> Dict: conv = conversation_lib.default_conversation.copy() roles = {"human": conv.roles[0], "gpt": conv.roles[1]}

# Apply prompt templates
conversations = []
for i, source in enumerate(sources):
    if roles[source[0]["from"]] != conv.roles[0]:
        # Skip the first one if it is not from human
        source = source[1:]

    conv.messages = []
    for j, sentence in enumerate(source):
        role = roles[sentence["from"]]
        assert role == conv.roles[j % 2], f"{i}"
        conv.append_message(role, sentence["value"])
    conversations.append(conv.get_prompt())

# Tokenize conversations

if has_image:
    input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in conversations], dim=0)
else:
    input_ids = tokenizer(
        conversations,
        return_tensors="pt",
        padding="longest",
        max_length=tokenizer.model_max_length,
        truncation=True,
    ).input_ids

# remove the first bos token
if input_ids[0][0] == input_ids[0][1] == tokenizer.bos_token_id:
    input_ids = input_ids[:, 1:]
targets = input_ids.clone()

assert conv.sep_style == conversation_lib.SeparatorStyle.LLAMA_3_1

# Mask targets
sep= '<|start_header_id|>' + conv.roles[1] + '<|end_header_id|>' + '\n\n'
#sep = conv.sep + conv.roles[1] + ": "
for conversation, target in zip(conversations, targets):
    total_len = int(target.shape[0])

    rounds = conversation.split(conv.tokenizer.eos_token)
    rounds= [rounds[0]] + [rounds[idx] + rounds[idx+1] for idx in range(1, len(rounds)-1, 2)]

    cur_len = 1
    target[:cur_len] = IGNORE_INDEX
    for i, rou in enumerate(rounds):
        if rou == "":
            break

        parts = rou.split(sep)
        if len(parts) != 2 and i != 0:
            break

        if i == 0:
            round_len = len(tokenizer(rou, add_special_tokens=False).input_ids)
            instruction_len = len(tokenizer(rou, add_special_tokens=False).input_ids)

        else:
            parts[0] += sep
            if has_image:
                round_len = len(tokenizer_image_token(rou, tokenizer)) + 1
                instruction_len = len(tokenizer_image_token(parts[0], tokenizer))
            else:
                round_len = len(tokenizer(rou).input_ids) + 1
                instruction_len = len(tokenizer(parts[0]).input_ids)

        # if i > 0: round_len += 1
        target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
        cur_len += round_len

    target[cur_len:] = IGNORE_INDEX
    cur_len= cur_len + len(tokenizer(sep, add_special_tokens=False).input_ids)

    # if cur_len > tokenizer.model_max_length: print(f"WARNING: max length context")
    if cur_len < tokenizer.model_max_length:
        if cur_len != total_len:
            target[:] = IGNORE_INDEX
            print(
                f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
                f" (ignored)"
            )

return dict(
    input_ids=input_ids,
    labels=targets,
)

`

Bleking commented 2 months ago

tokenizer_path= os.getenv("TOKENIZER_PATH") llama_tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)

conv_llava_llama_3_1 = Conversation( system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.", #roles=("<|start_header_id|>user", "<|start_header_id|>assistant"), roles=("user", "assistant"), version="llama_3_1", messages=[], offset=0, sep_style=SeparatorStyle.LLAMA_3_1, tokenizer=llama_tokenizer, stop_token_ids=[128009, 128008, 128001], )

It looks like this code snippet you shared here is different from that on your HuggingFace page. Is this snippet here how you are asking me to edit the 'conversation.py' code?

mylesgoose commented 2 months ago

you have to choose which method you are going to use to correct the original conversation.py and the train.py to handle the llama 3.1 model. if you do nothing and choose llama_v3 as the version string. it will use the llama 3.0 tokenizer etc. and separator. there are three more more methods i have found to achieve that. probably the simplest hack is to use the original conversation.py and modify this line conv_llava_llama_3 = Conversation( system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.", roles=("user", "assistant"), version="llama_v3", messages=[], offset=0, sep="<|eot_id|>", sep_style=SeparatorStyle.LLAMA_3, tokenizer_id="mylesgoose/Llama-3.1-Minitron-4B-Width-Base", tokenizer=safe_load_tokenizer("mylesgoose/Llama-3.1-Minitron-4B-Width-Base"), stop_token_ids=[128009], ) then in your training script you should train with prompt version llava_llama_3

Bleking commented 2 months ago

you have to choose which method you are going to use to correct the original conversation.py and the train.py to handle the llama 3.1 model. if you do nothing and choose llama_v3 as the version string. it will use the llama 3.0 tokenizer etc. and separator. there are three more more methods i have found to achieve that. probably the simplest hack is to use the original conversation.py and modify this line conv_llava_llama_3 = Conversation( system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.", roles=("user", "assistant"), version="llama_v3", messages=[], offset=0, sep="<|eot_id|>", sep_style=SeparatorStyle.LLAMA_3, tokenizer_id="mylesgoose/Llama-3.1-Minitron-4B-Width-Base", tokenizer=safe_load_tokenizer("mylesgoose/Llama-3.1-Minitron-4B-Width-Base"), stop_token_ids=[128009], ) then in your training script you should train with prompt version llava_llama_3

Alright. I have successfully finetuned LLaVA-NeXT with your suggested version of "conversation.py". However, I would like to get something confirmed before I start evaluating the model.

image

Is it normal for the "config.json" and "tokenizer.json" files to be in different directories like this? I always had to manually move the "config.json" file to the "checkpoint-7" folder because when I run the lmms-eval evaluation command, if either of them is missing in the designated directory, the process was halted.

Also, this is what my config.json file by the finetuning looks like. Please confirm if there is anything wrong, such as the '_name_or_path' value:

{
  "_name_or_path": "mylesgoose/Meta-Llama-3.1-8B-Instruct-goose-abliterated",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "hidden_act": "silu",
  "hidden_size": 4096,
  "image_aspect_ratio": "anyres",
  "image_crop_resolution": null,
  "image_grid_pinpoints": [
    [
      384,
      768
    ],
    [
      768,
      384
    ],
    [
      768,
      768
    ],
    [
      1152,
      384
    ],
    [
      384,
      1152
    ]
  ],
  "image_split_resolution": null,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "mm_hidden_size": 1152,
  "mm_newline_position": "one_token",
  "mm_patch_merge_type": "spatial_unpad",
  "mm_projector_lr": 2e-05,
  "mm_projector_type": "mlp2x_gelu",
  "mm_resampler_type": null,
  "mm_spatial_pool_mode": "bilinear",
  "mm_tunable_parts": "mm_vision_tower,mm_mlp_adapter,mm_language_model",
  "mm_use_im_patch_token": false,
  "mm_use_im_start_end": false,
  "mm_vision_select_feature": "patch",
  "mm_vision_select_layer": -2,
  "mm_vision_tower": "google/siglip-so400m-patch14-384",
  "mm_vision_tower_lr": 2e-06,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pos_skipping_range": 4096,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "tokenizer_model_max_length": 512,
  "tokenizer_padding_side": "right",
  "torch_dtype": "bfloat16",
  "transformers_version": "4.44.2",
  "use_cache": true,
  "use_mm_proj": true,
  "use_pos_skipping": false,
  "vision_tower_pretrained": null,
  "vocab_size": 128256
}

Was it supposed to be "mylesgoose/Llama-3.1-Minitron-4B-Width-Base", not mylesgoose/Meta-Llama-3.1-8B-Instruct-goose-abliterated.

I will also share with you my script for your understanding.

LLM_VERSION="mylesgoose/Meta-Llama-3.1-8B-Instruct-goose-abliterated"
LLM_VERSION_CLEAN="${LLM_VERSION////}"
VISION_MODEL_VERSION="google/siglip-so400m-patch14-384"
VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION////}"

############### Pretrain ################

PROMPT_VERSION=plain

BASE_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain"
echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"

CKPT_PATH=$LLM_VERSION

deepspeed llava/train/train_mem.py \
    --lora_enable True --lora_r 32 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed scripts/zero3_offload.json \
    --model_name_or_path ${CKPT_PATH} \
    --version ${PROMPT_VERSION} \
    --data_path ./playground/floorplan_data/floorplan_vqa_1000.json \
    --image_folder ./playground/floorplan_data \
    --pretrain_mm_mlp_adapter="./checkpoints/projectors/llavanext-google_siglip-so400m-patch14-384-mylesgoose_Meta-Llama-3.1-8B-Instruct-goose-abliterated-mlp2x_gelu-pretrain_blip558k_plain/checkpoint-1500/mm_projector.bin" \
    --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
    --mm_vision_tower_lr=2e-6 \
    --vision_tower ${VISION_MODEL_VERSION} \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --group_by_modality_length True \
    --image_aspect_ratio anyres \
    --image_grid_pinpoints "[(384, 768), (768, 384), (768, 768), (1152, 384), (384, 1152)]" \
    --mm_patch_merge_type spatial_unpad \
    --fp16 True \
    --output_dir "./checkpoints/Meta-Llama-3.1-8B-Instruct-goose-abliterated-pre" \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 32 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 2 \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 512 \
    --gradient_checkpointing True \
    --dataloader_num_workers 2 \
    --lazy_preprocess True \
    --report_to wandb \
    --torch_compile True \
    --torch_compile_backend "inductor" \
    --dataloader_drop_last True \
    --attn_implementation sdpa \
    --run_name llavanext-siglip-400m-Meta-Llama-3.1-8B-pretrain_blip558k_plain \

Plus, this is my command:

python3 -m accelerate.commands.launch --num_processes=4 -m lmms_eval --model llava --model_args pretrained=/home/work/testdataset1/LLaVA-NeXT/checkpoints/results-NeXT/Meta-Llama-3.1-8B-Instruct-goose-abliterated-pre/checkpoint-7,conv_template=llava_llama_3,device=cuda --tasks floorplan_test_wilder --batch_size 1 --log_samples --log_samples_suffix llava_next --output_path ./logs/

To summarise, I would like to confirm with you two things: 1) if the 'config.json' and 'tokenizer.json' files being placed separately is natural 2) If I had set the LLM_VERSION value correctly.

Thank you.

mylesgoose commented 2 months ago

there are quite a few problems there. I also forgot to mention that stop token ids for llama 3.1 is three not one. like this :

conv_llava_llama_3 = Conversation(
    system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.",
    roles=("user", "assistant"),
    version="llama_v3",
    messages=[],
    offset=0,
    sep="<|eot_id|>",
    sep_style=SeparatorStyle.LLAMA_3,
    tokenizer_id="mylesgoose/Llama-3.1-Minitron-4B-Width-Base",
    tokenizer=safe_load_tokenizer("mylesgoose/Llama-3.1-Minitron-4B-Width-Base"),
    stop_token_ids=[128009, 128008, 128001],
)

the tokenizer id is adjusted depending on what model your training as a base. so for example i am currently training that model minitron so i have that loaded there. you are training the mylesgoose alliterated model so you have that model version string there. Its normal to have to move the files as your doing a lora train. and sometimes it does not save the correct files. but I dont like doing lora training unless iw ant to test something. i prefer to modify the entire model. which i am pretty sure you can do if you have 192gb of vram.

What does your zero3.json look like? why are you doing 32 gradient accumulation steps? that is allot? i recomeent doing this cahnge this

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },

to this

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
and if you get vram crashes even try the offload of parameters to cpu change device none to cpu. You are only training a 8b model you have more than enough vram. 

you are also training your model with anyres and your image gridpoits are the older ones. try something like this:
``` bash
LLM_VERSION="mylesgoose/Meta-Llama-3.1-8B-Instruct-goose-abliterated"
LLM_VERSION_CLEAN="${LLM_VERSION//\//_}"
VISION_MODEL_VERSION="google/siglip-so400m-patch14-384"
VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION//\//_}"

############### Pretrain ################

PROMPT_VERSION=llava_llama_3

BASE_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain"
echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"

deepspeed llava/train/train_mem.py \
     --deepspeed scripts/zero3.json \
    --model_name_or_path ${CKPT_PATH} \
    --version ${PROMPT_VERSION} \
    --data_path ./json/LLaVA-NeXT-Datap.json \
    --image_folder ./data/images \
    --video_folder ./data/videos \
    --pretrain_mm_mlp_adapter "PATH TO YOUR BIN FILE" \
    --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
    --mm_vision_tower_lr=2e-6 \
    --vision_tower ${VISION_MODEL_VERSION} \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --group_by_modality_length True \
    --image_aspect_ratio anyres_max_9 \
    --image_grid_pinpoints  "(1x1),...,(6x6)" \
    --mm_patch_merge_type spatial_unpad \
    --bf16 True \
    --run_name $MID_RUN_NAME \
    --output_dir "./checkpoints/${MID_RUN_NAME}" \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 0 \
    --gradient_accumulation_steps 6 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50 \
    --save_total_limit 2 \
    --learning_rate 3e-5 \
    --weight_decay 0.01 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 31072 \
    --gradient_checkpointing True \
    --dataloader_num_workers 6 \
    --lazy_preprocess True \
    --report_to wandb \
    --torch_compile True \
    --torch_compile_backend "inductor" \
    --dataloader_drop_last True \
    --frames_upbound 32 \
    --attn_implementation flash_attention_2 \

and if you get cuda out of memory errors then reduce the max length to 8192 or something.

mylesgoose commented 2 months ago

you second question you can write what you want to call your model on hugginface in that config part. but leave as the base model until you uploaded the full models to your hugginface account and then you have a name to put there.

Bleking commented 2 months ago

there are quite a few problems there. I also forgot to mention that stop token ids for llama 3.1 is three not one. like this :

conv_llava_llama_3 = Conversation(
    system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.",
    roles=("user", "assistant"),
    version="llama_v3",
    messages=[],
    offset=0,
    sep="<|eot_id|>",
    sep_style=SeparatorStyle.LLAMA_3,
    tokenizer_id="mylesgoose/Llama-3.1-Minitron-4B-Width-Base",
    tokenizer=safe_load_tokenizer("mylesgoose/Llama-3.1-Minitron-4B-Width-Base"),
    stop_token_ids=[128009, 128008, 128001],
)

the tokenizer id is adjusted depending on what model your training as a base. so for example i am currently training that model minitron so i have that loaded there. you are training the mylesgoose alliterated model so you have that model version string there. Its normal to have to move the files as your doing a lora train. and sometimes it does not save the correct files. but I dont like doing lora training unless iw ant to test something. i prefer to modify the entire model. which i am pretty sure you can do if you have 192gb of vram.

What does your zero3.json look like? why are you doing 32 gradient accumulation steps? that is allot? i recomeent doing this cahnge this

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },

to this

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
and if you get vram crashes even try the offload of parameters to cpu change device none to cpu. You are only training a 8b model you have more than enough vram. 

you are also training your model with anyres and your image gridpoits are the older ones. try something like this:
``` bash
LLM_VERSION="mylesgoose/Meta-Llama-3.1-8B-Instruct-goose-abliterated"
LLM_VERSION_CLEAN="${LLM_VERSION//\//_}"
VISION_MODEL_VERSION="google/siglip-so400m-patch14-384"
VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION//\//_}"

############### Pretrain ################

PROMPT_VERSION=llava_llama_3

BASE_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain"
echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"

deepspeed llava/train/train_mem.py \
     --deepspeed scripts/zero3.json \
    --model_name_or_path ${CKPT_PATH} \
    --version ${PROMPT_VERSION} \
    --data_path ./json/LLaVA-NeXT-Datap.json \
    --image_folder ./data/images \
    --video_folder ./data/videos \
    --pretrain_mm_mlp_adapter "PATH TO YOUR BIN FILE" \
    --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
    --mm_vision_tower_lr=2e-6 \
    --vision_tower ${VISION_MODEL_VERSION} \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --group_by_modality_length True \
    --image_aspect_ratio anyres_max_9 \
    --image_grid_pinpoints  "(1x1),...,(6x6)" \
    --mm_patch_merge_type spatial_unpad \
    --bf16 True \
    --run_name $MID_RUN_NAME \
    --output_dir "./checkpoints/${MID_RUN_NAME}" \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 0 \
    --gradient_accumulation_steps 6 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50 \
    --save_total_limit 2 \
    --learning_rate 3e-5 \
    --weight_decay 0.01 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 31072 \
    --gradient_checkpointing True \
    --dataloader_num_workers 6 \
    --lazy_preprocess True \
    --report_to wandb \
    --torch_compile True \
    --torch_compile_backend "inductor" \
    --dataloader_drop_last True \
    --frames_upbound 32 \
    --attn_implementation flash_attention_2 \

and if you get cuda out of memory errors then reduce the max length to 8192 or something.

Well, I have been using zero3_offload so far as I constantly got OOM error whenever I used zero3. I guess I am just going to stick to LoRA finetuning because it is generally more efficient.

Let me show you what my zero3_offload looks like; I created a new file named "zero3_offload_new" in case I might have to use the original one.

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 5e7,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": 15099494, 
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "steps_per_print": 1e5,
    "wall_clock_breakdown": false
}
Bleking commented 2 months ago

you second question you can write what you want to call your model on hugginface in that config part. but leave as the base model until you uploaded the full models to your hugginface account and then you have a name to put there.

And whether I use "mylesgoose/Meta-Llama-3.1-8B-Instruct-goose-abliterated" or "mylesgoose/Llama-3.1-Minitron-4B-Width-Base", in the "conversation.py" code, I guess I have to set the 'mm_projector' as the same one with that, right?

Well, since I had already ran the evaluation about an hour ago, having the GPUs running at the moment, I will let you know what I got with your suggestion later.

mylesgoose commented 2 months ago

you second question you can write what you want to call your model on hugginface in that config part. but leave as the base model until you uploaded the full models to your hugginface account and then you have a name to put there.

And whether I use "mylesgoose/Meta-Llama-3.1-8B-Instruct-goose-abliterated" or "mylesgoose/Llama-3.1-Minitron-4B-Width-Base", in the "conversation.py" code, I guess I have to set the 'mm_projector' as the same one with that, right?

I think you should play around with that smaller model the 4B one but its up to you. you use whatever tokenizer goes with your model. that conversation.py only loads the tokenzier and config from the hugginface i think both those models use the same tokenizer. except i added some reflection thinking tags into the smaller models tokenizer. so it can learn about those tags. you ahve to use the bin file depnding on what model your loading to train. unless its a saved model and then it ouputs with the bin file inside the model.

mylesgoose commented 2 months ago

this is zereo3 ofload "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", see here its offloading the optimizer to the cpu "pin_memory": true

Bleking commented 2 months ago

Hi @mylesgoose. Thank you for sharing with me your codes again. I really appreciate it. But I have another question. You recommended me that I use mylesgoose/Llama-3.1-Minitron-4B-Width-Base. Did you mean that I have to set the tokenizer as Llama-3.1-Minitron-4B-Width-Base or set the LLM_VERSION as this model.

I am asking you because when I have Llama-3.1-Minitron-4B-Width-Base for the LLM_VERSION, and "mylesgoose/Llama-3.1-Minitron-4B-Width-Base" for the 'tokenizer' and 'tokenizer_id' of the conversation.py file, I kept getting size mismatch error.

Were you telling me have Llama-3.1-Minitron-4B-Width-Base as my tokenizer, not the LLM_VERSION? Or have I set something wrong in either of them?

And for your information, I will have to use two GPUs with 20 CPU cores and 240GiB RAM for a while since there is another co-user of the server who needs to use two GPUs.

Script

LLM_VERSION="mylesgoose/Llama-3.1-Minitron-4B-Width-Base"
LLM_VERSION_CLEAN="${LLM_VERSION//\//_}"
VISION_MODEL_VERSION="google/siglip-so400m-patch14-384"
VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION//\//_}"

############### Pretrain ################

PROMPT_VERSION=llava_llama_3

BASE_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain"
echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"

CKPT_PATH=$LLM_VERSION

deepspeed llava/train/train_mem.py \
    --lora_enable True --lora_r 32 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed scripts/zero3_offload_new.json \
    --model_name_or_path ${CKPT_PATH} \
    --version ${PROMPT_VERSION} \
    --data_path /home/work/testdataset1/LLaVA-NeXT-old/playground/floorplan_data/floorplan_vqa_1000.json \
    --image_folder /home/work/testdataset1/LLaVA-NeXT-old/playground/floorplan_data \
    --pretrain_mm_mlp_adapter="./checkpoints/projectors/Llama-3.1-Minitron-4B-Width-Base/vision/mm_projector.bin" \
    --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
    --mm_vision_tower_lr=2e-6 \
    --vision_tower ${VISION_MODEL_VERSION} \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --group_by_modality_length True \
    --image_aspect_ratio anyres_max_9 \
    --image_grid_pinpoints "(1x1),...,(6x6)" \
    --mm_patch_merge_type spatial_unpad \
    --fp16 True \
    --output_dir "./checkpoints/Llama-3.1-Minitron-4B-Width-Base" \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 2 \
    --learning_rate 3e-5 \
    --weight_decay 0.01 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 1024 \
    --gradient_checkpointing True \
    --dataloader_num_workers 2 \
    --lazy_preprocess True \
    --report_to wandb \
    --torch_compile True \
    --torch_compile_backend "inductor" \
    --dataloader_drop_last True \
    --frames_upbound 32 \
    --attn_implementation sdpa \
    --run_name llavanext-siglip-400m-Meta-Llama-3.1-8B-pretrain_blip558k_plain \

conversation.py snippet

conv_llava_llama_3 = Conversation(
    system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.",
    roles=("user", "assistant"),
    version="llama_v3",
    messages=[],
    offset=0,
    sep="<|eot_id|>",
    sep_style=SeparatorStyle.LLAMA_3,
    tokenizer_id="mylesgoose/Llama-3.1-Minitron-4B-Width-Base", 
    tokenizer=safe_load_tokenizer("mylesgoose/Llama-3.1-Minitron-4B-Width-Base"), 
    stop_token_ids=[128009, 128008, 128001], 
)
Bleking commented 2 months ago

I just ran the following code with the original model in order to check if the model and the tokenizer are loaded properly but I still got the same size mismatch error.

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("nvidia/Llama-3.1-Minitron-4B-Width-Base")
model = AutoModel.from_pretrained("nvidia/Llama-3.1-Minitron-4B-Width-Base")

print("Tokenizer vocab size:", tokenizer.vocab_size)
print("Model embedding size:", model.config.vocab_size)

The 'model' variable itself had an error being loaded, showing size mismatches. But the tokenizer is loaded properly. I guess the model won't be able to to used for the LLM of LLaVA-NeXT.

But still thank you for the help! :D

mylesgoose commented 2 months ago

you should set the tokenizer to the model your training.

mylesgoose commented 2 months ago

i was able to train that model mylesgoose/Llama-3.1-Minitron-4B-Llava-Nvidia-siglip-ov

Bleking commented 2 months ago

I just upgraded transformers to 4.45.1 using pip install -U transformers and now I am able to use it as the model. Thank you.

Bleking commented 2 months ago

Hello again, @mylesgoose. I would like to ask for your help again as you are more likely to know more about the LLMs you suggested to me.

I guess I figured out the reason for the having a string of exclamation marks as the generated response of this model, which is also the case for LLaVA-OneVision-Qwen-7B in my case. During the finetuning process, including when I do not use LoRA, the loss always stays the same. It means the code is just being run with the GPUs while the model is not being trained.

The below is what happened during the full-finetuning; I just halted the process because the rarely seemed to decrease and this has been keeping happening to me since LLaVA-NeXT. I did not have such an issue with LLaVA-v1.6-13B.

{'loss': 18.7233, 'grad_norm': 0.0, 'learning_rate': 0.0, 'epoch': 0.13}                                                                                                              
{'loss': 18.7233, 'grad_norm': 0.0, 'learning_rate': 0.0, 'epoch': 0.25}                                                                                                              
 29%|█████████████████████████████████████████▋                                                                                                        | 2/7 [11:25<28:30, 342.01s/it]^C[2024-09-29 09:58:10,606] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 662624
Traceback (most recent call last):
  File "/home/work/anaconda3/envs/llava-next/bin/deepspeed", line 6, in <module>
    main()
  File "/home/work/anaconda3/envs/llava-next/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 584, in main
    result.wait()
  File "/home/work/anaconda3/envs/llava-next/lib/python3.10/subprocess.py", line 1209, in wait
    return self._wait(timeout=timeout)
  File "/home/work/anaconda3/envs/llava-next/lib/python3.10/subprocess.py", line 1959, in _wait
    (pid, sts) = self._try_wait(0)
  File "/home/work/anaconda3/envs/llava-next/lib/python3.10/subprocess.py", line 1917, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

I know I have to debug the process to find out what prevent the model from being finetuned but I have a difficulty exactly how I should debug it. Therefore, I would like to ask you if there is anything missing in my setting, which is likely to contribute to my current issue. I have to solve this training issue first so I can move on to training the next versions of the LLaVA series.

Let me share with you my script that uses your mylesgoose/Llama-3.1-Minitron-4B-Width-Base, and the corresponding snippets from 'conversation.py' and 'train.py'. I utilised the so-called simplest hack for using LLaMA-3.1 by editing the conv_llava_llama_3 variable and I don't think I really did anything for 'train.py'

you have to choose which method you are going to use to correct the original conversation.py and the train.py to handle the llama 3.1 model. if you do nothing and choose llama_v3 as the version string. it will use the llama 3.0 tokenizer etc. and separator. there are three more more methods i have found to achieve that. probably the simplest hack is to use the original conversation.py and modify this line conv_llava_llama_3 = Conversation( system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.", roles=("user", "assistant"), version="llama_v3", messages=[], offset=0, sep="<|eot_id|>", sep_style=SeparatorStyle.LLAMA_3, tokenizer_id="mylesgoose/Llama-3.1-Minitron-4B-Width-Base", tokenizer=safe_load_tokenizer("mylesgoose/Llama-3.1-Minitron-4B-Width-Base"), stop_token_ids=[128009], ) then in your training script you should train with prompt version llava_llama_3

script

LLM_VERSION="mylesgoose/Llama-3.1-Minitron-4B-Width-Base"
LLM_VERSION_CLEAN="${LLM_VERSION//\//_}"
VISION_MODEL_VERSION="google/siglip-so400m-patch14-384"
VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION//\//_}"

############### Pretrain ################

PROMPT_VERSION=llava_llama_3

BASE_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain"
echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"

CKPT_PATH=$LLM_VERSION

deepspeed llava/train/train_mem.py \ 
    --deepspeed scripts/zero3_new.json \
    --model_name_or_path ${CKPT_PATH} \
    --version ${PROMPT_VERSION} \
    --data_path ./playground/floorplan_data/floorplan_vqa_1000.json \
    --image_folder ./playground/floorplan_data \
    --pretrain_mm_mlp_adapter="./checkpoints/projectors/Llama-3.1-Minitron-4B-Width-Base/vision/mm_projector.bin" \
    --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
    --mm_vision_tower_lr=2e-6 \
    --vision_tower ${VISION_MODEL_VERSION} \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --group_by_modality_length True \
    --image_aspect_ratio anyres_max_9 \
    --image_grid_pinpoints "(1x1),...,(6x6)" \
    --mm_patch_merge_type spatial_unpad \
    --fp16 True \
    --output_dir "./checkpoints/Llama-3.1-Minitron-4B-Width-Base" \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 32 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 5000 \
    --save_total_limit 2 \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 512 \
    --gradient_checkpointing True \
    --dataloader_num_workers 2 \
    --lazy_preprocess True \
    --report_to wandb \
    --torch_compile True \
    --torch_compile_backend "inductor" \
    --dataloader_drop_last True \
    --frames_upbound 32 \
    --attn_implementation sdpa \
    --run_name llavanext-siglip-400m-Meta-Llama-3.1-Minitron-4B-pretrain_blip558k_plain \

zero3_new.json

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 100,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

conversation.py

conv_llava_llama_3 = Conversation(
    system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.",
    roles=("user", "assistant"),
    version="llama_v3",
    messages=[],
    offset=0,
    sep="<|eot_id|>",
    sep_style=SeparatorStyle.LLAMA_3,
    tokenizer_id="mylesgoose/Llama-3.1-Minitron-4B-Width-Base",  # tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
    tokenizer=safe_load_tokenizer("mylesgoose/Llama-3.1-Minitron-4B-Width-Base"),  # tokenizer=safe_load_tokenizer("meta-llama/Meta-Llama-3-8B-Instruct"),
    stop_token_ids=[128009, 128008, 128001],  # stop_token_ids=[128009],
)

train.py

def preprocess_llama3(
    sources,
    tokenizer: transformers.PreTrainedTokenizer,
    has_image: bool = False,
    max_len=2048,
    system_message: str = "You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.",
) -> Dict:
    # roles = {"human": "<|start_header_id|>user<|end_header_id|>", "gpt": "<|start_header_id|>assistant<|end_header_id|>"}
    roles = {"human": "user", "gpt": "assistant"}

    # Add image tokens to tokenizer as a special tokens
    # Use a deepcopy of tokenizer so that we don't modify on the tokenizer
    tokenizer = copy.deepcopy(tokenizer)
    # When there is actually an image, we add the image tokens as a special token
    if has_image:
        tokenizer.add_tokens(["<image>"], special_tokens=True)
    image_token_index = tokenizer.convert_tokens_to_ids("<image>")
    bos_token_id = tokenizer.convert_tokens_to_ids("<|begin_of_text|>")
    start_header_id = tokenizer.convert_tokens_to_ids("<|start_header_id|>")
    end_header_id = tokenizer.convert_tokens_to_ids("<|end_header_id|>")
    eot_id = tokenizer.convert_tokens_to_ids("<|eot_id|>")

    unmask_tokens = ["<|begin_of_text|>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "\n\n"]
    unmask_tokens_idx = [tokenizer.convert_tokens_to_ids(tok) for tok in unmask_tokens]

    # After update, calling tokenizer of llama3 will
    # auto add bos id for the tokens. ヽ(`⌒´)ノ
    def safe_tokenizer_llama3(text):
        input_ids = tokenizer(text).input_ids
        if input_ids[0] == bos_token_id:
            input_ids = input_ids[1:]
        return input_ids

    nl_tokens = tokenizer.convert_tokens_to_ids("\n\n")
    # Apply prompt templates
    input_ids, targets = [], []
    for i, source in enumerate(sources):
        if roles[source[0]["from"]] != roles["human"]:
            source = source[1:]

        input_id, target = [], []

        # New version, use apply chat template
        # Build system message for each sentence
        input_id += tokenizer.apply_chat_template([{"role" : "system", "content" : system_message}])
        target += [IGNORE_INDEX] * len(input_id)

        for conv in source:
            # Make sure llava data can load
            try:
                role = conv["role"]
                content = conv["content"]
            except:
                role = conv["from"]
                content = conv["value"]

            role =  roles.get(role, role)

            conv = [{"role" : role, "content" : content}]
            # First is bos token we don't need here
            encode_id = tokenizer.apply_chat_template(conv)[1:]
            input_id += encode_id
            if role in ["user", "system"]:
                target += [IGNORE_INDEX] * len(encode_id)
            else:
                target += encode_id

        assert len(input_id) == len(target), f"{len(input_id)} != {len(target)}"
        for idx, encode_id in enumerate(input_id):
            if encode_id in unmask_tokens_idx:
                target[idx] = encode_id
            if encode_id == image_token_index:
                input_id[idx] = IMAGE_TOKEN_INDEX
        input_ids.append(input_id)
        targets.append(target)
    input_ids = torch.tensor(input_ids, dtype=torch.long)
    targets = torch.tensor(targets, dtype=torch.long)

    # print(f"Input IDs: {input_ids}")
    # print(f"Targets: {targets}")
    # print('----------')

    return dict(
        input_ids=input_ids,  # tensor(bs x seq_len)
        labels=targets,  # tensor(bs x seq_len)
    )

Again, for your information, I am currently using two NVIDIA RTX6000 GPUs with RAM size 240.00GiB and 20 CPU cores.

With respect to these files I have mentioned, please let me know what could possibly contribute the model from being trained(the loss staying at 18.7233). If I am missing anything. I am not sure if it is due to the tokenization mismatch or what. I did solve the tokenization mismatch issue of LLaVA-v1.6-34B but this case looks different from it.

Thank you.

mylesgoose commented 1 month ago

I use flash attn. and gradient steps less than 6. but yes that's not normal for loss to be 18. sounds like issue with stop token or padding. i will have a look when at home.

Bleking commented 1 month ago

Yes, I do think that it must be the problem with the mismatches of the token things or padding. I tried the gradient_accumulation_step as 6 as well but it only made the finetuning process longer without reducing the loss. Neither do I think it is the image size issue because LLaVA-v1.5-13B and LLaVA-v1.6-34B are trained well with the dataset. Do you see anything in my code that could possibly contribute to my loss problem?

Bleking commented 1 month ago

Hi. I just tried it with 3 epochs and the loss started dropping from 18.7233 to 11.4435 at about 1.19 epoch, and ended up with 7.0317 loss. I am glad for the decreasing but not very comfortable as that number is not a low value.

Were you fine with 1 epoch only with Llama-3.1-Minitron-4B-Width-Base model? I would still be glad if you could still suggest me some other ways as well, including the token or padding issues! Maybe I can try increasing the learning rate or train longer once the evaluation is over.

mylesgoose commented 1 month ago

normally after about 100 steps the loss is about. 0.7. so something with your setup is not right. I don't know exactly what. I would start with a fresh model and make the bin file. do the full pretrain. and that will only take perhaps one day. setup the llava repo to standard and change that configuration. py file to point to your new repo. clone a Hugging face repo to your own repo even and then you can change parameters on Hugging Face. try setting the defaults for everything to the Llama 3 configuration except point to that new repo of yours location. then we can see what's going on.

Bleking commented 1 month ago

My custom dataset's size is 1000, and there was no issue finetuning LLaVA-v1.5-13B and LLaVA-v1.6-34B with it. However, despite using the finetuned LLaVA-NeXT we have been talking about, because the generated answers are always a repetition of a text("Cara"s my case) I constantly get "Sorry! We've encountered an issue with repetitive patterns in your prompt. Please try again with a different prompt." messages when I get the result of "response = requests.post(API_URL, headers=headers, json=payload, timeout=60)" in lmms-eval.

How big was your dataset? I am starting to wonder if LLaVA-NeXT and its next versions are too complex for the data, causing an overfitting.

Maybe I should just train longer or increase the learning rate, hoping to decreasing the loss further less than 7.

I decided not to really increase the size of the VQA dataset in order to train several LLaVA versions with this dataset, because I need to write a paper and I have no time to increase its size.

mylesgoose commented 1 month ago

sounds like a chat template issue. print the chat template for the model and the token IDs. when running it. ensure it is saying startheader etc eot_id..

mylesgoose commented 1 month ago

I trained one with the 600,000 the llava next 1.6 dataset and one with 1 millionthe open lalava next dataset. but with 1000 can be done and quite quickly. you issue sounds to me like it's not formating the training json data into appropriate format for the model at training time.

mylesgoose commented 1 month ago

@Bleking just hought would tell you that there is a script to train the llama 3.2 11b directly and its working on 24 rtx so will certainly work on yours git@github.com:meta-llama/llama-recipes.git


torchrun --nnodes 1 --nproc_per_node 8 recipes/quickstart/finetuning/finetuning.py \
    --enable_fsdp \
    --lr 1e-5 \
    --num_epochs 1 \
    --batch_size_training 1 \
    --model_name meta-llama/Llama-3.2-11B-Vision-Instruct \
    --dist_checkpoint_root_folder ./finetuned_model \
    --dist_checkpoint_folder fine-tuned \
    --use_fast_kernels \
    --dataset "custom_dataset" \
    --custom_dataset.test_split "test" \
    --custom_dataset.file "/home/myles/llama-recipes/recipes/quickstart/finetuning/datasets/ocrvqa_dataset.py" \
    --run_validation True \
    --batching_strategy padding \
    --use_wandb True

```` its pretty simple
Bleking commented 1 month ago

@Bleking just hought would tell you that there is a script to train the llama 3.2 11b directly and its working on 24 rtx so will certainly work on yours git@github.com:meta-llama/llama-recipes.git

torchrun --nnodes 1 --nproc_per_node 8 recipes/quickstart/finetuning/finetuning.py \
    --enable_fsdp \
    --lr 1e-5 \
    --num_epochs 1 \
    --batch_size_training 1 \
    --model_name meta-llama/Llama-3.2-11B-Vision-Instruct \
    --dist_checkpoint_root_folder ./finetuned_model \
    --dist_checkpoint_folder fine-tuned \
    --use_fast_kernels \
    --dataset "custom_dataset" \
    --custom_dataset.test_split "test" \
    --custom_dataset.file "/home/myles/llama-recipes/recipes/quickstart/finetuning/datasets/ocrvqa_dataset.py" \
    --run_validation True \
    --batching_strategy padding \
    --use_wandb True

```` its pretty simple

Hi. Thank you for sharing this with me. Let me give it a try soon.

And I just wanted to give you an update that by increasing the epoch to 8(and even 10), the "Sorry! We've encountered an issue with repetitive patterns in your prompt. Please try again with a different prompt." has been gone; it was because of the repetition of "Cara"s as the generated responses. However, I still get irrelevant generated responses by the finetuned model, which is a single string "Cara". The loss has decreased to around 0.3 with epoch 10.

{'loss': 0.3095, 'grad_norm': 17.053686141967773, 'learning_rate': 2.3125191111135387e-06, 'epoch': 8.96}                                                          
{'train_runtime': 12414.799, 'train_samples_per_second': 0.81, 'train_steps_per_second': 0.006, 'train_loss': 5.67327256160123, 'epoch': 8.96}                     
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [3:26:53<00:00, 177.34s/it]

I have not tried printing the template and token ids like you told me but what I can tell you now is that my VQA dataset(chat template) has five QA pairs per image, making it totally of 5000 QA pairs. I used the same template as for LLaVA-v1.5 and v1.6, although I might have to edit the format, I have to keep the five conversations per image.

monologue1107 commented 12 hours ago

Hi, I am trying to finetune LLaVA-NeXT with my custom dataset, using "finetune_clip.sh" shell file.

I gave some edits to the shell for my convenience and to satisfy my task so far, like this:

export OMP_NUM_THREADS=8
export NCCL_IB_DISABLE=0
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO

LLM_VERSION="Qwen/Qwen2-7B-Instruct"
LLM_VERSION_CLEAN="${LLM_VERSION//\//_}"
VISION_MODEL_VERSION="openai/clip-vit-large-patch14-336"
VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION//\//_}"

############### Pretrain ################

PROMPT_VERSION="qwen_1_5"

BASE_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain"
echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"

NUM_GPUS=${NUM_GPUS:-2}
NNODES=${NNODES:-1}
RANK=${RANK:-0}
ADDR=${ADDR:-'localhost'}
PORT=${PORT:-'29500'}
MID_RUN_NAME=${MID_RUN_NAME:-'floorplan_vqa_1000_results'}

echo "NUM_GPUS: ${NUM_GPUS}"
echo "NNODES: ${NNODES}"
echo "RANK: ${RANK}"
echo "ADDR: ${ADDR}"
echo "PORT: ${PORT}"
echo "MID_RUN_NAME: ${MID_RUN_NAME}"

ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node="${NUM_GPUS}" --nnodes="${NNODES}" --node_rank="${RANK}" --master_addr="${ADDR}" --master_port="${PORT}" \
    llava/train/train_xformers.py \
    --deepspeed scripts/zero3_offload.json \
    --model_name_or_path ${LLM_VERSION} \
    --version ${PROMPT_VERSION} \
    --data_path='testdataset1/masters/floorplan_vqa/floorplan_vqa_1000.json' \
    --image_folder /home/work/testdataset1/LLaVA/playground/data/floorplan_data/SPA \
    --pretrain_mm_mlp_adapter="./checkpoints/open-llava-next-llama3-8b/pretrain/mm_projector.bin" \
    --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
    --mm_vision_tower_lr=2e-6 \
    --vision_tower ${VISION_MODEL_VERSION} \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --group_by_modality_length True \
    --image_aspect_ratio anyres \
    --image_grid_pinpoints "[(336, 672), (672, 336), (672, 672), (1008, 336), (336, 1008)]" \
    --mm_patch_merge_type spatial_unpad \
    --fp16 True \
    --run_name $MID_RUN_NAME \
    --output_dir "./checkpoints/${MID_RUN_NAME}" \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 3000 \
    --save_total_limit 1 \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 32768 \
    --gradient_checkpointing True \
    --dataloader_num_workers 16 \
    --lazy_preprocess True \
    --report_to wandb \
    --torch_compile True \
    --torch_compile_backend "inductor" \
    --dataloader_drop_last True \
    --attn_implementation sdpa

# You can delete the sdpa attn_implementation if you want to use flash attn

For your information, the edits include:

  • train_mem -> xformers (retrieved xformers from the previous LLaVA since LLaVA-NeXT did not provide it)
  • zero3.json -> zero3_offload.json
  • 'bf16' -> 'fp16' ,
  • 'tf32': True -> False,
  • 'per_device_train_batch_size' value -> 4, 'gradient_accumulation_steps' value -> 16, since I have two GPUs in my server, to satisfy the global batch size for finetuning, 128=4 16 2.

I had to apply xformers since I have been having trouble using flash attention in my environment.

My current problem is with the pretrained model, especially the 'pretrain_mm_mlp_adapter' argument.

Since I could not find a proper .bin model file, I just retrieved the model from https://huggingface.co/Lin-Chen/open-llava-next-llama3-8b/tree/main/pretrain that I found from this repository.

However, I constantly face some size match errors and I am stuck.

Would it be the size difference between my custom dataset and the pretrained model?

I will share with you the error message for your better understanding as well.

(llava) work@main1[s010-jiwon-thesis]:~/testdataset1/LLaVA-NeXT$ bash scripts/train/finetune_clip.sh 
BASE_RUN_NAME: llavanext-openai_clip-vit-large-patch14-336-Qwen_Qwen2-7B-Instruct-mlp2x_gelu-pretrain_blip558k_plain
NUM_GPUS: 2
NNODES: 1
RANK: 0
ADDR: localhost
PORT: 29500
MID_RUN_NAME: floorplan_vqa_1000_results
Please install pyav to use video processing functions.Please install pyav to use video processing functions.

OpenCLIP not installed
OpenCLIP not installed
[2024-09-08 09:19:06,010] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-08 09:19:06,010] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-09-08 09:19:09,817] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-09-08 09:19:09,817] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
The speedups for torchdynamo mostly come wih GPU Ampere or higher and which is not detected here.
/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Rank 0:  Overwriting config with {'use_pos_skipping': False, 'pos_skipping_range': 4096, 'mm_spatial_pool_mode': 'bilinear'}
[2024-09-08 09:19:10,116] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-09-08 09:19:10,212] [INFO] [comm.py:652:init_distributed] cdb=None
The speedups for torchdynamo mostly come wih GPU Ampere or higher and which is not detected here.
/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
main1:608321:608321 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
main1:608321:608321 [0] NCCL INFO Bootstrap : Using eth0:10.63.0.2<0>
main1:608321:608321 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
main1:608321:608321 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
main1:608321:608321 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
main1:608321:608321 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)

main1:608321:608321 [0] misc/cudawrap.cc:188 NCCL WARN Failed to find CUDA library /opt/kernel/libcuda.so (NCCL_CUDA_PATH='/opt/kernel') : /opt/kernel/libcuda.so: cannot open shared object file: No such file or directory
NCCL version 2.20.5+cuda12.4
[2024-09-08 09:19:10,599] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
main1:608321:608442 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
main1:608321:608442 [0] NCCL INFO P2P plugin IBext
main1:608321:608442 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
main1:608321:608442 [0] NCCL INFO NET/IB : No device found.
main1:608321:608442 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
main1:608321:608442 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
main1:608321:608442 [0] NCCL INFO NET/IB : No device found.
main1:608321:608442 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
main1:608321:608442 [0] NCCL INFO NET/Socket : Using [0]eth0:10.63.0.2<0>
main1:608321:608442 [0] NCCL INFO Using non-device net plugin version 0
main1:608321:608442 [0] NCCL INFO Using network Socket

main1:608322:608322 [1] misc/cudawrap.cc:188 NCCL WARN Failed to find CUDA library /opt/kernel/libcuda.so (NCCL_CUDA_PATH='/opt/kernel') : H}�
main1:608322:608322 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
main1:608322:608322 [1] NCCL INFO Bootstrap : Using eth0:10.63.0.2<0>
main1:608322:608322 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
main1:608322:608322 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
main1:608322:608322 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
main1:608322:608322 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
main1:608322:608447 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
main1:608322:608447 [1] NCCL INFO P2P plugin IBext
main1:608322:608447 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
main1:608322:608447 [1] NCCL INFO NET/IB : No device found.
main1:608322:608447 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
main1:608322:608447 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
main1:608322:608447 [1] NCCL INFO NET/IB : No device found.
main1:608322:608447 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
main1:608322:608447 [1] NCCL INFO NET/Socket : Using [0]eth0:10.63.0.2<0>
main1:608322:608447 [1] NCCL INFO Using non-device net plugin version 0
main1:608322:608447 [1] NCCL INFO Using network Socket
main1:608322:608447 [1] NCCL INFO comm 0xbbb1280 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId d8000 commId 0x97ad33e05ea99c2c - Init START
main1:608321:608442 [0] NCCL INFO comm 0xd83b8a0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 86000 commId 0x97ad33e05ea99c2c - Init START
main1:608322:608447 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000
main1:608321:608442 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000
main1:608322:608447 [1] NCCL INFO comm 0xbbb1280 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
main1:608321:608442 [0] NCCL INFO comm 0xd83b8a0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
main1:608322:608447 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
main1:608321:608442 [0] NCCL INFO Channel 00/02 :    0   1
main1:608322:608447 [1] NCCL INFO P2P Chunksize set to 131072
main1:608321:608442 [0] NCCL INFO Channel 01/02 :    0   1
main1:608321:608442 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
main1:608321:608442 [0] NCCL INFO P2P Chunksize set to 131072
main1:608322:608447 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
main1:608321:608442 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
main1:608322:608447 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
main1:608321:608442 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
main1:608322:608447 [1] NCCL INFO Connected all rings
main1:608322:608447 [1] NCCL INFO Connected all trees
main1:608321:608442 [0] NCCL INFO Connected all rings
main1:608322:608447 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
main1:608321:608442 [0] NCCL INFO Connected all trees
main1:608322:608447 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
main1:608321:608442 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
main1:608321:608442 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
main1:608322:608447 [1] NCCL INFO comm 0xbbb1280 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId d8000 commId 0x97ad33e05ea99c2c - Init COMPLETE
main1:608321:608442 [0] NCCL INFO comm 0xd83b8a0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 86000 commId 0x97ad33e05ea99c2c - Init COMPLETE
[2024-09-08 09:19:12,978] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 678, num_elems = 15.23B
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [00:08<00:00,  2.19s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [00:09<00:00,  2.37s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Rank 0:  Prompt version: qwen_1_5
Rank 0:  Loading vision tower: openai/clip-vit-large-patch14-336
[2024-09-08 09:19:23,971] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
[2024-09-08 09:19:24,105] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
[2024-09-08 09:19:24,405] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 1069, num_elems = 15.53B
/home/work/testdataset1/LLaVA-NeXT/llava/model/llava_arch.py:108: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location="cpu")
/home/work/testdataset1/LLaVA-NeXT/llava/model/llava_arch.py:108: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location="cpu")
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/work/testdataset1/LLaVA-NeXT/llava/train/train_mem.py", line 4, in <module>
[rank1]:     train()
[rank1]:   File "/home/work/testdataset1/LLaVA-NeXT/llava/train/train.py", line 1549, in train
[rank1]:     model.get_model().initialize_vision_modules(model_args=model_args, fsdp=training_args.fsdp)
[rank1]:   File "/home/work/testdataset1/LLaVA-NeXT/llava/model/llava_arch.py", line 113, in initialize_vision_modules
[rank1]:     incompatible_keys = self.mm_projector.load_state_dict(get_w(mm_projector_weights, "mm_projector"))
[rank1]:   File "/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict
[rank1]:     raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
[rank1]: RuntimeError: Error(s) in loading state_dict for Sequential:
[rank1]:        size mismatch for 0.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3584, 1024]).
[rank1]:        size mismatch for 0.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3584]).
[rank1]:        size mismatch for 2.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([3584, 3584]).
[rank1]:        size mismatch for 2.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3584]).
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/work/testdataset1/LLaVA-NeXT/llava/train/train_mem.py", line 4, in <module>
[rank0]:     train()
[rank0]:   File "/home/work/testdataset1/LLaVA-NeXT/llava/train/train.py", line 1549, in train
[rank0]:     model.get_model().initialize_vision_modules(model_args=model_args, fsdp=training_args.fsdp)
[rank0]:   File "/home/work/testdataset1/LLaVA-NeXT/llava/model/llava_arch.py", line 113, in initialize_vision_modules
[rank0]:     incompatible_keys = self.mm_projector.load_state_dict(get_w(mm_projector_weights, "mm_projector"))
[rank0]:   File "/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict
[rank0]:     raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
[rank0]: RuntimeError: Error(s) in loading state_dict for Sequential:
[rank0]:        size mismatch for 0.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3584, 1024]).
[rank0]:        size mismatch for 0.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3584]).
[rank0]:        size mismatch for 2.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([3584, 3584]).
[rank0]:        size mismatch for 2.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3584]).
W0908 09:19:26.526000 140475221702464 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 608322 closing signal SIGTERM
E0908 09:19:26.528000 140475221702464 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 608321) of binary: /home/work/anaconda3/envs/llava/bin/python
Traceback (most recent call last):
  File "/home/work/anaconda3/envs/llava/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/work/anaconda3/envs/llava/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
llava/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-08_09:19:26
  host      : main1
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 608321)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Thank you.

I wonder why the log shows [2024-09-08 09:19:12,978] [INFO] [partition_parameters.py:345:exit] finished initializing model - num_params = 678, num_elems = 15.23B when you load 7B model?