databricks / dbrx

Code examples and resources for DBRX, a large language model developed by Databricks
https://www.databricks.com/
Other
2.47k stars 231 forks source link

Bad performance on PrOntoQA benchmark #28

Open huskydoge opened 2 months ago

huskydoge commented 2 months ago

PrOntoQA is a question-answering dataset that generates examples with chains-of-thought that describe the reasoning required to answer the questions correctly. The sentences in the examples are syntactically simple and amenable to semantic parsing. It can be used to formally analyze the predicted chain-of-thought from large language models.

I have tested the performance of DBRX-Base on GSM8k, AQuA, strategyQA dataset using COT-4-shot, its performance is satisfying compared to other models (GPT4, Claude Opus, LLama 70B, etc.).

Nevertheless, when I test the model's performance on PrOntoQA, its performance is not that satisfying, where dbrx-instruction achieves a 24.2% accuracy and dbrx-base is worse. Although there might be some output processing errors when using dbrx-base, dbrx-instruct has no problem with endless generation but still fails to achieve a good performance.

Therefore, I want to know whether there is an official test result on PrOntoQA for others to take as a reference.

Thanks!

hanlint commented 2 months ago

Hello @huskydoge , we have not tried PrOntoQA yet, but will let you know if we do!