felipemaiapolo / tinyBenchmarks

Evaluating LLMs with fewer examples
MIT License
106 stars 11 forks source link

Creating my own tiny benchmark on different dataset #6

Closed show981111 closed 1 day ago

show981111 commented 2 months ago

Hi, thank you for the awesome work!

I am planning to create a tiny version of the new dataset (https://huggingface.co/datasets/HuggingFaceM4/VQAv2) for large multi-modal models(LMM). I read how you created the IRT model, and it seems like we need a lot of correctness data that were created from existing LLMs (from the paper, it is L_tr I think?), which you downloaded from the huggingface. So similarly, if I wanted to do this for LMMs, do I need correctness data from different LMMs? If so, how many would be a good number?

felipemaiapolo commented 2 months ago

Hi there,

Good to know you're thinking of creating that dataset!

In some of our tests, we used 15 LLMs for train and it worked for HELM. But I guess it might depend on the data... I think you would have to try with a small number of LMMs and test how well it works! If it does not work well, consider increasing the data. Please let me know if you need anything else!

show981111 commented 2 months ago

Hi again,

Thank you for your answer! I was testing with 5 LMMs and it seems pretty promising, so I decided to write a code to pipeline the whole process to easily create a tiny version of the benchmark given the result from various models, and the target benchmark.

I have one question regarding the scenario(benchmark). As I understood, the scenario represents each benchmark and subscenario is like a category of questions. I noticed that when you are training IRT model, the training data, which is Y, is based on subscenario (shape of Y is number_of_model * number_of_questions from all benchmarks), which means it doesn't really matter which scenario or which subscenario it came from.

Would the accuracy (performance) of anchor points be the same whether I train IRT with the results from all benchmarks and then perform clustering to get the anchor points or train IRT to the result from "each" benchmark and get the anchor points from each corresponding IRT?

I just found that bookkeeping scenario position and sub scenario position a bit confusing so wondering if I can train scenario by scenario without merging all scenarios.

felipemaiapolo commented 2 months ago

Good to hear you have promising results!

If you want, you can train one IRT for each scenario. There is no problem with that approach.

In test time, would people be necessarily interested in testing in all scenarios at once or there are cases in which they are interested in individual scenarios as well? If people are only interested in testing all scenarios at once, using one big IRT model can be beneficial. In our work, that leads to a better performance. However, even in that case, we still pick anchor points for each scenario.

Does this make sense?

show981111 commented 2 months ago

Thank you for the answer and that does make sense! I ended up just choosing to train multiple scenarios at the same time since it is more flexible.

Also, one thing I noticed was that if the order of questions in the train data changes, then the performance actually gets affected. For example, in the beginning, I randomly ordered questions in the training data, without stacking based on the sub scenarios, the accuracy was lower than when the data was ordered. Is this expected?

felipemaiapolo commented 2 months ago

Hello,

The ordering of the questions should not affect your results. However, there is some randomness in the training process, so there could small differences. How are things going?

Please remember to always use the same ordering in training and test.

show981111 commented 2 months ago

Things are working great. I am increasing the number of models and supported datasets and seems really good. If you don't mind, I would love to clean up my code and release it as a Python package or a sub-project of TinyBenchmarks, so that people can easily create the tiny benchmark with their own data. What do you think?

felipemaiapolo commented 2 months ago

Hi there,

Good to hear!! About releasing your code/results, it's up to you. If you decide to make it available through tinyBenchmarks, that'd be amazing! Count on us!