Open hccngu opened 1 year ago
Hi, thanks for the interesting analysis! The gsm8k and SVAMP datasets are indeed used for Flan-T5 training but we are not sure about the reason for the trend of worse performance with model size. This definitely deserves a closer look, please let us know what you find!
Overall, the larger the number of parameters in the model, the worse its performance. What do you think is the reason for this? Also, did you use the test sets of the three datasets mentioned above to train the model? If so, could the reason for this be that the smaller model overfit on the test data? Thank you~