Open shupinghu opened 3 months ago
Annother comparison:
in pixelLM:
in LISA:
Although PixelLM uses a lightweight decoder, it can do equivalent or even better work than using SAM on data sets such as refcoco, such as LISA. However, SAM is still more stable in the fine segmentation of complex scenes. This is related to the high-resolution input of SAM, and we are also working on solving this problem
Although PixelLM uses a lightweight decoder, it can do equivalent or even better work than using SAM on data sets such as refcoco, such as LISA. However, SAM is still more stable in the fine segmentation of complex scenes. This is related to the high-resolution input of SAM, and we are also working on solving this problem
So it means that pixelLM could have a better performance in refcoco as shown in paper, but pixelLM could not have a better performance in coco?
Since the reason is that SAM used a high-resolution input, what if we use a high-resolution input as well? According to "chat.py", the "image_size" is set to 1024, "resize_vision_tower_size" is set to 448, when I set "image_size=1920", "resize_vision_tower_size=1920", the infer results have no differences.
The openai-CLIP encoder we use is trained on 336 pixels. Even if you increase it to 1920, it will not be further helpful.
According to experience, the scenes in the COCO dataset will be more cluttered, and training at a higher resolution is helpful for this. But this will also bring a higher training load
Despite the input size of openai-CLIP, do you think that there are other reasones that maybe cause the poor performance on segment edge? For example, less training dataset?
Yes, our training volume of segmentation data (~200k) is nowhere near that of SAM. Maybe you can learn about ViT-adapter, which can optimize this problem
As I know, the training dataset that PixelLM used is the same as LISA, but because LISA uses SAM as the vision backbone (freeze the parameters) that use much more trainging dataset, this is the another reason that the performance of LISA is better, am I right?
When it comes to ViT-adapter, this is kind of finetuning method. Do you mean that I can use this method to finetune the released PixelLM-7B model in a bigger input size on my own dataset (or on the whole trainging dataset as in your paper)? If yes, maybe I also need change "--vision-tower='openai/clip-vit-large-patch14-336' " and "--resize_vision_tower_size=448" to a bigger one during training?
Thank you for your released code and weights! I'm trying to deploy the released model on V100 recently, but the seg performance is much worser than LISA, which is inconsistent with the results showed in the paper. Could you please help me to find the reason?
For example, I use the prompt "find the person in the picture" to seg the same picture, the output of PixelLM is:
The output of LISA is:
Note:
versions: Python: 3.8.0 Pytorch: 2.0.0+cu17 transformers: 4.31.0 tokenizers: 0.13.3 deepspeed: 0.14.0 openai: 0.27.8