Open LiJiaqi96 opened 2 weeks ago
Thanks for your try! I will fix it later~
@LiJiaqi96 Please have a try. have updated the code. The train_it_ds
is add with deepspeed and need some change.
Thanks! I tried "train_it_ds.py" without using deepspeed, but it doesn't work. Is it possible to train without using deepspeed? Temporally I prefer not to use deepspeed.
Yes! You can run it without deepspeed. BTW, show me you log so that I can fix the bug ~
Sorry for the late reply. The log is here
train_log.txt
in "config_7b_hd_stage4.py", I set enable=False
in deepspeed settings.
and run the code with:
torchrun --nnodes=${NNODE} --nproc_per_node=${NUM_GPUS} \
--rdzv_endpoint=${MASTER_NODE}:10068 \
--rdzv_backend=c10d \
tasks/train_it_ds.py \
$(dirname $0)/config_7b_hd_stage4.py \
output_dir ${OUTPUT_DIR}
I'm not sure whether it is cause by the deepspeed or pytorch verisons. Here are my versions of different packages:
torch 1.13.1+cu117
torchaudio 0.13.1+cu117
torchnet 0.0.4
torchvision 0.14.1+cu117
deepspeed 0.14.2
transformers 4.40.1
BTW, sometimes you can fix the bug by change find_unused_parameters
to True
or Fasle
.
Thanks, I will create an environment with exactly the same packages and have a try.
Hi, I found shared_utils_ds.py
has a bug in line 58.
optimizer_params = create_optimizer(config.optimizer, model, return_group=True)
the optimizer.py
may need to be updated.
Thanks for your feedback. I have updated the code.
I used the new environment except flash-attn, as I used CUDA 12.1 and can only use flash-attn==2.1.0. I ran the code "scripts/videochat_mistral/run_7b_stage4_hd.sh", with "tasks/train_it.py" and deepspeed enable=False
, then got error train_log0618.txt. The error seems to be caused by flash-attn.
Is it possible to run videochat2_hd using the same environment as videochat2_mistral, withou using deepspeed?
BTW I test to run the code on single GPU (like python train_it.py
) and it iterates normally
Yes, it's okay to use it without deepspeed. I use deepspeed ZERO to decrease the GPU memory~
I see. Is it ok for you to run on multiple GPUs without deepspeed, just as the model runs in videochat2_mistral?
Update: I managed to solve the previous issue by upgrading the flash-attn to 2.5.9. When I use "train_it_ds.py" and with deepspeed enable=True
, I met new issue about deepspeed config:
trainlog_0621.txt
Could you please help me solve that?
Hi! Please try again with the newly commit.
Thanks for your update! Now the code could run with deepspeed enabled.
BTW, Is there any place to find the newly added dataset for VideoChat2_HD? I suppose the datasets are important to improve model performances.
Almost all the datasets can be directly downloaded from their repos or homepages~
Give me feedback if you don't find them.
In "instruction_data.py", there are some newly added image datasets in M3IT, and some newly added videos datasets. Is there any place to find those video datasets? Thanks!
These datasets are generated from ShareGPTVideo, VidLN, FAVD and TimeIT_didemo.
Thanks for your sharing!
Another question, how could I obtain the checkpoint after VideoChat2_HD training? in "demo_mistral_hd.ipynb".
state_dict = torch.load("your_model_path/videochat2/videochat2_hd_mistral_stage4.pth", "cpu")
I noticed that there are several files in the "ckpt_latest.pth" folder, should I choose one of them?
Thanks!
These datasets are generated from ShareGPTVideo, VidLN, FAVD and TimeIT_didemo.
Hi, could you please help me find the instruction json files such as f"{anno_root_it}/video/caption/sharegptvideo/train_300k.json"
, I did not find the json files in the HF VideoChat2-IT repo.
Sorry for the late reply. For the checkpoint, you need to use the file named mp_xxx
which saves weights. For the instruction data, I will upload it today.
@LiJiaqi96 Please check the data in HuggingFace~
Hi, thanks for your update of VideoChat2_HD! When trying the newly-released code, I got some questions:
MetaLoader_rs
class in "train_it_ds.py" seems to be missing.MetaLoader_rs
.load_and_transform_media_data_image
function does not havedynamic_config
setting, which is passed to it in "it_dataset_mistral.py". I created a pull request to modify this part.