2U1 / Llama3.2-Vision-Finetune

An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.
Apache License 2.0
43 stars 8 forks source link

training hangs when <image> token is not first in the input sequence #11

Open rothfels opened 22 hours ago

rothfels commented 22 hours ago

I tried running full fine-tuning (bash scripts/finetune.sh) with this dataset: https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K

(The first row from the dataset exactly matches the example you provided in the README)

While that training example works correctly, others rows from the dataset where the <image> token is not the first in the input sequence cause the Trainer to hang indefinitely at the beginning of .train(). I assume this is when the program is loading/processing the training data.

Here is a minimal training data.json which includes only the first two rows of the LLaVA-Instruct-150K dataset, and causes the problem I'm describing:

[
    {
      "id": "000000033471",
      "image": "000000033471.jpg",
      "conversations": [
        {
            "from": "human",
            "value": "<image>\nWhat are the colors of the bus in the image?"
          },
          {
            "from": "gpt",
            "value": "The bus in the image is white and red."
          },
          {
            "from": "human",
            "value": "What feature can be seen on the back of the bus?"
          },
          {
            "from": "gpt",
            "value": "The back of the bus features an advertisement."
          },
          {
            "from": "human",
            "value": "Is the bus driving down the street or pulled off to the side?"
          },
          {
            "from": "gpt",
            "value": "The bus is driving down the street, which is crowded with people and other vehicles."
          }
      ]
    },
    {
      "id": "000000052846",
      "image": "000000052846.jpg",
      "conversations": [
        {
          "from": "human",
          "value": "Where is the cat positioned in the image?\n<image>"
        },
        {
          "from": "gpt",
          "value": "The cat is positioned on top of the back of the couch in the living room."
        },
        {
          "from": "human",
          "value": "What is the cat doing in the image?"
        },
        {
          "from": "gpt",
          "value": "The cat is coming out from some curtains onto the couch and is sitting or standing on top of it."
        },
        {
          "from": "human",
          "value": "What is the color of the curtains near the cat?"
        },
        {
          "from": "gpt",
          "value": "The color of the curtains near the cat is red."
        },
        {
          "from": "human",
          "value": "Besides the cat, what else can be seen on the couch?"
        },
        {
          "from": "gpt",
          "value": "There are lots of pillows on the couch besides the cat."
        },
        {
          "from": "human",
          "value": "What kind of room does the couch appear to be in?"
        },
        {
          "from": "gpt",
          "value": "The couch appears to be in a living room setting."
        }
      ]
    }
]

When I move the <image> token from the second example to the start of the sequence, training no longer hangs.

While the training hangs, top outputs something like this forever:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
3559980 ubuntu    20   0   95.2g   6.1g 110700 R 100.3   0.3   2:06.22 python
3560169 ubuntu    20   0   88.3g   6.1g  97324 R 100.0   0.3   2:06.15 python
3560359 ubuntu    20   0   88.6g   6.1g  97348 R 100.0   0.3   2:06.55 python
3562894 ubuntu    20   0   88.6g   6.1g  97348 R 100.0   0.3   2:06.08 python
3564172 ubuntu    20   0   94.8g   6.1g 114024 R 100.3   0.3   2:02.80 python
3556400 ubuntu    20   0   89.8g  24.0g  18.2g R 100.0   1.4  24:02.30 python
3559091 ubuntu    20   0   88.6g   6.1g  98928 R 100.0   0.3   2:06.13 python
3559092 ubuntu    20   0   95.2g   6.1g 112376 R 100.0   0.3   2:06.25 python
3559180 ubuntu    20   0   88.3g   6.1g  97324 R 100.0   0.3   2:06.13 python
3559406 ubuntu    20   0   88.5g   6.1g  97360 R 100.0   0.3   2:06.11 python
3559539 ubuntu    20   0   95.2g   6.1g 112376 R 100.0   0.3   2:06.85 python
3559540 ubuntu    20   0   95.2g   6.1g 112368 R 100.0   0.3   2:07.24 python
3559665 ubuntu    20   0   88.3g   6.1g  98988 R 100.0   0.3   2:06.14 python
3559917 ubuntu    20   0   88.6g   6.1g  97320 R 100.0   0.3   2:06.05 python
3560043 ubuntu    20   0   88.6g   6.1g  99016 R 100.0   0.3   2:06.47 python
3560232 ubuntu    20   0   88.7g   6.1g  97328 R 100.0   0.3   2:06.10 python
3560358 ubuntu    20   0   88.6g   6.1g  95656 R 100.0   0.3   2:06.99 python
3560876 ubuntu    20   0   88.7g   6.1g  97328 R 100.0   0.3   2:06.09 python
3560936 ubuntu    20   0   88.5g   6.1g  97360 R 100.0   0.3   2:06.11 python
3561993 ubuntu    20   0   88.6g   6.1g  97320 R 100.0   0.3   2:06.09 python
3563981 ubuntu    20   0   88.7g   6.1g  97360 R 100.0   0.3   2:02.72 python
3559218 ubuntu    20   0   88.7g   6.1g  97328 R  99.7   0.3   2:06.12 python
3559321 ubuntu    20   0   88.6g   6.1g  99012 R  99.7   0.3   2:06.06 python

Note: the dataset/processor implementation from https://github.com/2U1/Phi3-Vision-Finetune doesn't seem to have this problem from my testing.

rothfels commented 21 hours ago

Here's a simplified script demonstrating the problem (just create ./datasets/data.json using the example above and put two random images in ./images with appropriate names)

from transformers import AutoProcessor
from torch.utils.data import DataLoader
from src.training.data import make_supervised_data_module
from src.training.params import DataArguments

def main():
    model_id_for_processor = "meta-llama/Llama-3.2-11B-Vision-Instruct"
    processor = AutoProcessor.from_pretrained(model_id_for_processor, device='cuda')
    processor.padding_side = 'right'
    processor.pad_token = '<|finetune_right_pad_id|>'
    processor.pad_token_id = processor.tokenizer.convert_tokens_to_ids(processor.pad_token)
    assert processor.pad_token_id == 128004

    data_args = DataArguments(data_path="./datasets/data.json", image_folder="./images")

    data_module = make_supervised_data_module(processor=processor, data_args=data_args)

    dataloader = DataLoader(
        data_module['train_dataset'],
        batch_size=2,
        shuffle=False,
        num_workers=0,
        collate_fn=data_module['data_collator'],
        drop_last=True
    )

    print("reading batch...")
    batch = next(iter(dataloader))
    print(batch)

if __name__ == "__main__":
    main()
2U1 commented 14 hours ago

Actually you should add \n after thd image token. My code replaces thr exact pattern <image>\n Also I'm not sure llama can get the image file from middle or the end of the sequence.

rothfels commented 13 hours ago

You're right, thanks.

image

(from the llama image prompting docs)

The cross attention layer won't attend to an image that comes after the text tokens