NVlabs / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Apache License 2.0
1.85k stars 149 forks source link

Potential bug in mm_utils.py process_image function #54

Open hubenjm opened 4 months ago

hubenjm commented 4 months ago

When data_args.image_aspect_ratio = 'resize', it seems that mm_utils.process_image returns the image as a PIL.Image.Image data type, which has no shape attribute. See https://github.com/Efficient-Large-Model/VILA/blob/main/llava/mm_utils.py#L168

When doing stage 1 alignment training, we use the datasets.LazySupervisedDataset class, whose get_item function tries to call image.shape here: https://github.com/Efficient-Large-Model/VILA/blob/main/llava/data/dataset.py#L834

This crashes the training. So should we simply add the line image = processor.preprocess(image, return_tensors="pt")["pixel_values"][0] below line 168 of mm_utils.py: https://github.com/Efficient-Large-Model/VILA/blob/main/llava/mm_utils.py#L168 ?

Efficient-Large-Language-Model commented 4 months ago

Seems valid, we will verify on our end and make the changes.