Efficient-Large-Model / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Apache License 2.0
877 stars 55 forks source link

Potential bug in mm_utils.py process_image function #54

Open hubenjm opened 1 month ago

hubenjm commented 1 month ago

When data_args.image_aspect_ratio = 'resize', it seems that mm_utils.process_image returns the image as a PIL.Image.Image data type, which has no shape attribute. See https://github.com/Efficient-Large-Model/VILA/blob/main/llava/mm_utils.py#L168

When doing stage 1 alignment training, we use the datasets.LazySupervisedDataset class, whose get_item function tries to call image.shape here: https://github.com/Efficient-Large-Model/VILA/blob/main/llava/data/dataset.py#L834

This crashes the training. So should we simply add the line image = processor.preprocess(image, return_tensors="pt")["pixel_values"][0] below line 168 of mm_utils.py: https://github.com/Efficient-Large-Model/VILA/blob/main/llava/mm_utils.py#L168 ?

Efficient-Large-Language-Model commented 1 month ago

Seems valid, we will verify on our end and make the changes.