Open jungle-gym-ac opened 5 months ago
Yes that's right.
Hi, could you share the evaluation code if possible? I'm not able to reproduce the reported scores on vstar with the provided checkpoint when I use the multiple_choices_inference function. Thanks.
Hi! Can you share what results you get using multiple_choices_inference function? V* results can fluctuate a bit because of the small scale of the dataset
I'm getting direct_attributes: 46.1 and relative_position: 57.9 for llava-s2 checkpoint. But with the same code I'm able to reproduce the llava checkpoint's scores ie, direct_attributes: 43.5 and relative_position: 56.6. Can I conclude that these values are within the fluctuation limit?
Interesting. I think this is beyond the range of reasonable fluctuation. Could you test the 13b checkpoint of llava-s2 to see if there's also a regression here?
For 13b checkpoint I'm getting the following: direct_attributes: 46.1 relative_positions: 63.2. I'm attaching the evaluation code I used as well. eval_script
Hello, may I ask about how you evaluate your models on Vstar? Did you just directly used multiple_choices_inference function provided from Vstar to calculate the log-likelihood of the model on each option, and select the option with the highest likelihood