training hangs when <image> token is not first in the input sequence

rothfels commented 1 month ago

I tried running full fine-tuning (bash scripts/finetune.sh) with this dataset: https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K

(The first row from the dataset exactly matches the example you provided in the README)

While that training example works correctly, others rows from the dataset where the <image> token is not the first in the input sequence cause the Trainer to hang indefinitely at the beginning of .train(). I assume this is when the program is loading/processing the training data.

Here is a minimal training data.json which includes only the first two rows of the LLaVA-Instruct-150K dataset, and causes the problem I'm describing:

[
    {
      "id": "000000033471",
      "image": "000000033471.jpg",
      "conversations": [
        {
            "from": "human",
            "value": "<image>\nWhat are the colors of the bus in the image?"
          },
          {
            "from": "gpt",
            "value": "The bus in the image is white and red."
          },
          {
            "from": "human",
            "value": "What feature can be seen on the back of the bus?"
          },
          {
            "from": "gpt",
            "value": "The back of the bus features an advertisement."
          },
          {
            "from": "human",
            "value": "Is the bus driving down the street or pulled off to the side?"
          },
          {
            "from": "gpt",
            "value": "The bus is driving down the street, which is crowded with people and other vehicles."
          }
      ]
    },
    {
      "id": "000000052846",
      "image": "000000052846.jpg",
      "conversations": [
        {
          "from": "human",
          "value": "Where is the cat positioned in the image?\n<image>"
        },
        {
          "from": "gpt",
          "value": "The cat is positioned on top of the back of the couch in the living room."
        },
        {
          "from": "human",
          "value": "What is the cat doing in the image?"
        },
        {
          "from": "gpt",
          "value": "The cat is coming out from some curtains onto the couch and is sitting or standing on top of it."
        },
        {
          "from": "human",
          "value": "What is the color of the curtains near the cat?"
        },
        {
          "from": "gpt",
          "value": "The color of the curtains near the cat is red."
        },
        {
          "from": "human",
          "value": "Besides the cat, what else can be seen on the couch?"
        },
        {
          "from": "gpt",
          "value": "There are lots of pillows on the couch besides the cat."
        },
        {
          "from": "human",
          "value": "What kind of room does the couch appear to be in?"
        },
        {
          "from": "gpt",
          "value": "The couch appears to be in a living room setting."
        }
      ]
    }
]

When I move the <image> token from the second example to the start of the sequence, training no longer hangs.

While the training hangs, top outputs something like this forever:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
3559980 ubuntu    20   0   95.2g   6.1g 110700 R 100.3   0.3   2:06.22 python
3560169 ubuntu    20   0   88.3g   6.1g  97324 R 100.0   0.3   2:06.15 python
3560359 ubuntu    20   0   88.6g   6.1g  97348 R 100.0   0.3   2:06.55 python
3562894 ubuntu    20   0   88.6g   6.1g  97348 R 100.0   0.3   2:06.08 python
3564172 ubuntu    20   0   94.8g   6.1g 114024 R 100.3   0.3   2:02.80 python
3556400 ubuntu    20   0   89.8g  24.0g  18.2g R 100.0   1.4  24:02.30 python
3559091 ubuntu    20   0   88.6g   6.1g  98928 R 100.0   0.3   2:06.13 python
3559092 ubuntu    20   0   95.2g   6.1g 112376 R 100.0   0.3   2:06.25 python
3559180 ubuntu    20   0   88.3g   6.1g  97324 R 100.0   0.3   2:06.13 python
3559406 ubuntu    20   0   88.5g   6.1g  97360 R 100.0   0.3   2:06.11 python
3559539 ubuntu    20   0   95.2g   6.1g 112376 R 100.0   0.3   2:06.85 python
3559540 ubuntu    20   0   95.2g   6.1g 112368 R 100.0   0.3   2:07.24 python
3559665 ubuntu    20   0   88.3g   6.1g  98988 R 100.0   0.3   2:06.14 python
3559917 ubuntu    20   0   88.6g   6.1g  97320 R 100.0   0.3   2:06.05 python
3560043 ubuntu    20   0   88.6g   6.1g  99016 R 100.0   0.3   2:06.47 python
3560232 ubuntu    20   0   88.7g   6.1g  97328 R 100.0   0.3   2:06.10 python
3560358 ubuntu    20   0   88.6g   6.1g  95656 R 100.0   0.3   2:06.99 python
3560876 ubuntu    20   0   88.7g   6.1g  97328 R 100.0   0.3   2:06.09 python
3560936 ubuntu    20   0   88.5g   6.1g  97360 R 100.0   0.3   2:06.11 python
3561993 ubuntu    20   0   88.6g   6.1g  97320 R 100.0   0.3   2:06.09 python
3563981 ubuntu    20   0   88.7g   6.1g  97360 R 100.0   0.3   2:02.72 python
3559218 ubuntu    20   0   88.7g   6.1g  97328 R  99.7   0.3   2:06.12 python
3559321 ubuntu    20   0   88.6g   6.1g  99012 R  99.7   0.3   2:06.06 python

Note: the dataset/processor implementation from https://github.com/2U1/Phi3-Vision-Finetune doesn't seem to have this problem from my testing.

rothfels commented 1 month ago

Here's a simplified script demonstrating the problem (just create ./datasets/data.json using the example above and put two random images in ./images with appropriate names)

from transformers import AutoProcessor
from torch.utils.data import DataLoader
from src.training.data import make_supervised_data_module
from src.training.params import DataArguments

def main():
    model_id_for_processor = "meta-llama/Llama-3.2-11B-Vision-Instruct"
    processor = AutoProcessor.from_pretrained(model_id_for_processor, device='cuda')
    processor.padding_side = 'right'
    processor.pad_token = '<|finetune_right_pad_id|>'
    processor.pad_token_id = processor.tokenizer.convert_tokens_to_ids(processor.pad_token)
    assert processor.pad_token_id == 128004

    data_args = DataArguments(data_path="./datasets/data.json", image_folder="./images")

    data_module = make_supervised_data_module(processor=processor, data_args=data_args)

    dataloader = DataLoader(
        data_module['train_dataset'],
        batch_size=2,
        shuffle=False,
        num_workers=0,
        collate_fn=data_module['data_collator'],
        drop_last=True
    )

    print("reading batch...")
    batch = next(iter(dataloader))
    print(batch)

if __name__ == "__main__":
    main()

2U1 commented 1 month ago

Actually you should add \n after thd image token. My code replaces thr exact pattern <image>\n Also I'm not sure llama can get the image file from middle or the end of the sequence.

rothfels commented 1 month ago

You're right, thanks.

(from the llama image prompting docs)

The cross attention layer won't attend to an image that comes after the text tokens

2U1 / Llama3.2-Vision-Finetune

training hangs when <image> token is not first in the input sequence #11