The model "llava-7b-llama-2-7b-chat" merged by myself had problems during training.

zhangyupeng123 commented 1 year ago

Hello, we have merged the model "zhangyupeng/llava-7b-llama-2-7b-chat" by ourselves. Two 3090 Gpus are used for training, Batch_size=2 and grad_accumulation_steps=40. The following problems appear during training. Is this the reason for our own merged models?

Traceback (most recent call last): File "/home/zhangyupeng/anaconda3/envs/lisa/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/home/zhangyupeng/anaconda3/envs/lisa/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch return self.collate_fn(data) File "/mnt/21T/zhangyupeng/code/LISA/utils/dataset.py", line 135, in collate_fn assert cur_len == total_len AssertionError

[2023-09-05 20:58:14,118] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 77018 [2023-09-05 20:58:14,119] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 77019 [2023-09-05 20:58:15,023] [ERROR] [launch.py:321:sigkill_handler] ['/home/zhangyupeng/anaconda3/envs/lisa/bin/python', '-u', 'train_ds.py', '--local_rank=1'] exits with return code = 1

X-Lai commented 1 year ago

I think this is caused by datasets. Can you check whether the datasets are correctly organized?

zhangyupeng123 commented 1 year ago

Hi~@X-Lai , After we downloaded and unzipped the dataset, we changed the file name according to your request and uploaded it to the server. Are you saying there are other changes that need to be made?

dddraxxx commented 11 months ago

I feel like this is a model problem. Because I can run llama-13b but can't run merged llama-7b. And I don't know how to solve this.

dddraxxx commented 11 months ago

I solved this. Just add legacy=True in

tokenizer = transformers.AutoTokenizer.from_pretrained(
        args.version,
        cache_dir=None,
        model_max_length=args.model_max_length,
        padding_side="right",
        use_fast=False,
        legacy=True
    )

Refer to link

AmrinKareem commented 10 months ago

Worked for me, thanks!

Amazingren commented 1 month ago

Worked for me, thanks!

Dear @AmrinKareem ,

I met the same issue. And this solution also works for me. May I ask if it will affect the results of training the LISA model?

It would be super helpful for me.

Best regards and many thanks,

dvlab-research / LISA

The model "llava-7b-llama-2-7b-chat" merged by myself had problems during training. #45