bfshi / scaling_on_scales

When do we not need larger vision models?
MIT License
340 stars 11 forks source link

About Vstar Evaluation #14

Open jungle-gym-ac opened 5 months ago

jungle-gym-ac commented 5 months ago

Hello, may I ask about how you evaluate your models on Vstar? Did you just directly used multiple_choices_inference function provided from Vstar to calculate the log-likelihood of the model on each option, and select the option with the highest likelihood

bfshi commented 5 months ago

Yes that's right.

nikitha0808 commented 1 month ago

Hi, could you share the evaluation code if possible? I'm not able to reproduce the reported scores on vstar with the provided checkpoint when I use the multiple_choices_inference function. Thanks.

bfshi commented 1 month ago

Hi! Can you share what results you get using multiple_choices_inference function? V* results can fluctuate a bit because of the small scale of the dataset

nikitha0808 commented 1 month ago

I'm getting direct_attributes: 46.1 and relative_position: 57.9 for llava-s2 checkpoint. But with the same code I'm able to reproduce the llava checkpoint's scores ie, direct_attributes: 43.5 and relative_position: 56.6. Can I conclude that these values are within the fluctuation limit?

bfshi commented 1 month ago

Interesting. I think this is beyond the range of reasonable fluctuation. Could you test the 13b checkpoint of llava-s2 to see if there's also a regression here?

nikitha0808 commented 1 month ago

For 13b checkpoint I'm getting the following: direct_attributes: 46.1 relative_positions: 63.2. I'm attaching the evaluation code I used as well. eval_script