Worser performance compared with LISA which is inconsistent with the results showed in the paper

MaverickRen / PixelLM

PixelLM is an effective and efficient LMM for pixel-level reasoning and understanding. PixelLM is accepted by CVPR 2024.

Apache License 2.0

147 stars 4 forks source link

Worser performance compared with LISA which is inconsistent with the results showed in the paper #13

Open shupinghu opened 3 months ago

shupinghu commented 3 months ago

Thank you for your released code and weights! I'm trying to deploy the released model on V100 recently, but the seg performance is much worser than LISA, which is inconsistent with the results showed in the paper. Could you please help me to find the reason?

For example, I use the prompt "find the person in the picture" to seg the same picture, the output of PixelLM is: q78ZpawfjW The output of LISA is: Fd4jOoDOKd Note:

For PixelLM, we use "PixelLM-7B" float32 model. If we use bf16 model, the error ""upsample_nearest2d_out_frame" not implemented for 'BFloat16'" occured; If we use fp16 model, the error "'LlamaAttention' object has no attribute 'rope_theta'" occured
For LISA, we use 7B bf16 model.

versions: Python: 3.8.0 Pytorch: 2.0.0+cu17 transformers: 4.31.0 tokenizers: 0.13.3 deepspeed： 0.14.0 openai： 0.27.8

shupinghu commented 3 months ago

Annother comparison: in pixelLM: 55CGCiEV4J in LISA: 2aZthRCPRC

MaverickRen commented 3 months ago

Although PixelLM uses a lightweight decoder, it can do equivalent or even better work than using SAM on data sets such as refcoco, such as LISA. However, SAM is still more stable in the fine segmentation of complex scenes. This is related to the high-resolution input of SAM, and we are also working on solving this problem

shupinghu commented 3 months ago

Although PixelLM uses a lightweight decoder, it can do equivalent or even better work than using SAM on data sets such as refcoco, such as LISA. However, SAM is still more stable in the fine segmentation of complex scenes. This is related to the high-resolution input of SAM, and we are also working on solving this problem

So it means that pixelLM could have a better performance in refcoco as shown in paper, but pixelLM could not have a better performance in coco?

Since the reason is that SAM used a high-resolution input, what if we use a high-resolution input as well? According to "chat.py", the "image_size" is set to 1024, "resize_vision_tower_size" is set to 448, when I set "image_size=1920", "resize_vision_tower_size=1920", the infer results have no differences.

MaverickRen commented 3 months ago

The openai-CLIP encoder we use is trained on 336 pixels. Even if you increase it to 1920, it will not be further helpful.

MaverickRen commented 3 months ago

According to experience, the scenes in the COCO dataset will be more cluttered, and training at a higher resolution is helpful for this. But this will also bring a higher training load

shupinghu commented 3 months ago

Despite the input size of openai-CLIP, do you think that there are other reasones that maybe cause the poor performance on segment edge? For example, less training dataset?

MaverickRen commented 3 months ago

Yes, our training volume of segmentation data (~200k) is nowhere near that of SAM. Maybe you can learn about ViT-adapter, which can optimize this problem

shupinghu commented 3 months ago

As I know, the training dataset that PixelLM used is the same as LISA, but because LISA uses SAM as the vision backbone (freeze the parameters) that use much more trainging dataset, this is the another reason that the performance of LISA is better, am I right?

When it comes to ViT-adapter, this is kind of finetuning method. Do you mean that I can use this method to finetune the released PixelLM-7B model in a bigger input size on my own dataset (or on the whole trainging dataset as in your paper)? If yes, maybe I also need change "--vision-tower='openai/clip-vit-large-patch14-336' " and "--resize_vision_tower_size=448" to a bigger one during training?