Closed rothfels closed 1 month ago
Here's a simplified script demonstrating the problem (just create ./datasets/data.json
using the example above and put two random images in ./images
with appropriate names)
from transformers import AutoProcessor
from torch.utils.data import DataLoader
from src.training.data import make_supervised_data_module
from src.training.params import DataArguments
def main():
model_id_for_processor = "meta-llama/Llama-3.2-11B-Vision-Instruct"
processor = AutoProcessor.from_pretrained(model_id_for_processor, device='cuda')
processor.padding_side = 'right'
processor.pad_token = '<|finetune_right_pad_id|>'
processor.pad_token_id = processor.tokenizer.convert_tokens_to_ids(processor.pad_token)
assert processor.pad_token_id == 128004
data_args = DataArguments(data_path="./datasets/data.json", image_folder="./images")
data_module = make_supervised_data_module(processor=processor, data_args=data_args)
dataloader = DataLoader(
data_module['train_dataset'],
batch_size=2,
shuffle=False,
num_workers=0,
collate_fn=data_module['data_collator'],
drop_last=True
)
print("reading batch...")
batch = next(iter(dataloader))
print(batch)
if __name__ == "__main__":
main()
Actually you should add \n after thd image token. My code replaces thr exact pattern <image>\n
Also I'm not sure llama can get the image file from middle or the end of the sequence.
You're right, thanks.
(from the llama image prompting docs)
The cross attention layer won't attend to an image that comes after the text tokens
I tried running full fine-tuning (
bash scripts/finetune.sh
) with this dataset: https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K(The first row from the dataset exactly matches the example you provided in the README)
While that training example works correctly, others rows from the dataset where the
<image>
token is not the first in the input sequence cause theTrainer
to hang indefinitely at the beginning of.train()
. I assume this is when the program is loading/processing the training data.Here is a minimal training
data.json
which includes only the first two rows of theLLaVA-Instruct-150K
dataset, and causes the problem I'm describing:When I move the
<image>
token from the second example to the start of the sequence, training no longer hangs.While the training hangs,
top
outputs something like this forever:Note: the dataset/processor implementation from
https://github.com/2U1/Phi3-Vision-Finetune
doesn't seem to have this problem from my testing.