Open zwcolin opened 2 months ago
Yes, I think we should do this but currently we are out of bandwidth for it.
We wish if anyone could try to implement this, very much thanks if someone can do.
Thanks for the message Bo! I was raising this issue since I'm trying to integrate our recently released benchmark that already includes results for 35 models (https://charxiv.github.io/) to your repo but I could not test my integration due to being able to separate model generation from response grading as two separate runs. I wouldn't mind helping implement this functionality but it'd be great if someone from your team can help coordinate on this. I have also fixed some bugs with your process_docs
in my fork. Let me know if we could make this happen and thanks for all the effort for this awesome project!
Hi @zwcolin , do you think the from log model can help you with this? You can first generate output logs using --predict-only
flag and then use the from_log
model to simply perform aggregation and processing results.
You can check the last section of this blog for more detail
@zwcolin Great! you can join our discord to have a close discussion with our teams.
And you can ping me, I am BobaGPT at that channel.
Hi @zwcolin , do you think the from log model can help you with this? You can first generate output logs using
--predict-only
flag and then use thefrom_log
model to simply perform aggregation and processing results.You can check the last section of this blog for more detail
Hi Kaichen, thanks for the pointer! I'll try this later today and see if it works.
@zwcolin Great! you can join our discord to have a close discussion with our teams.
And you can ping me, I am BobaGPT at that channel.
I joined your dc a few days ago and I'll ping you later :)
Thanks for the awesome codebase! I wonder if you have a sample script which would allow one to break down model evaluations into generating responses and validating responses into two separate procedures? Many benchmarks e.g., mmbench, mmvet, hallusionbench, mathvista use GPT-assisted response validation. However, I believe that many compute clusters do not grant Internet access to compute nodes and conversely compute to nodes with Internet. Therefore, it'd be helpful to break down model response generation and response validation into two phases if it better fits users' workflow. If such functionality is already supported, could you provide an example script? Or could you implement this functionality? Thank you so much!