Closed jswift24 closed 8 months ago
Thanks for these questions. There's no easy way to measure the quality of the synthetic data generated by Bonito. We measure the quality by training the end model on the target task and reporting the F1 score. Next, we have not evaluated the quantized model but expect the original model to perform the best. Finally, we have experiments where we replace Bonito with existing LLMs like GPT-4 and Mistral-Instruct-v2 (see Section 7, Appendix B and C in the paper). We found that these LLMs often can improve the performance of the target LLM but not as high as Bonito. Sometimes, it can even hurt the performance if the target LLM is instruction tuned. Hope this answers all your questions.
@nihalnayak and team: Thanks for a really interesting paper! How do you think about the quality of synthetic data produced by Bonito?
For example, if I use the quantized model in Colab, are the outputs any worse than in the original? How much worse?
What if I skip the Bonito pipeline and just ask some LLM "Create a question and answer pair from
<some unannotated text>
" -- will I get output that is just as good as what Bonito produces?Thanks! Alon