THUDM / LongBench

[ACL 2024] LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
MIT License
603 stars 43 forks source link

Code for evaluation with GPT-3.5? #69

Open RuskinManku opened 1 month ago

RuskinManku commented 1 month ago

The results mention the scores of GPT-3.5 but I don't see how I can evaluate GPT using the code as it doesn't have that model.

bys0318 commented 1 month ago

The GPT-3.5-Turbo-16k model evaluated in our paper has already been deprecated. You can try gpt-3.5-turbo-0125 (16k), or the most recent gpt-4o-mini (128k), according to OpenAI (https://platform.openai.com/docs/models).

RuskinManku commented 1 month ago

Thanks for responding. Yes I can evaluate those, but I didn't find code where I can just change the open ai model and evaluate different ones.

bys0318 commented 1 month ago

Right. We didn't provide code for evaluating API models. You can modify the get_pred() fucntion to do so.