Bad performance on PrOntoQA benchmark

PrOntoQA is a question-answering dataset that generates examples with chains-of-thought that describe the reasoning required to answer the questions correctly. The sentences in the examples are syntactically simple and amenable to semantic parsing. It can be used to formally analyze the predicted chain-of-thought from large language models.

I have tested the performance of DBRX-Base on GSM8k, AQuA, strategyQA dataset using COT-4-shot, its performance is satisfying compared to other models (GPT4, Claude Opus, LLama 70B, etc.).

Nevertheless, when I test the model's performance on PrOntoQA, its performance is not that satisfying, where dbrx-instruction achieves a 24.2% accuracy and dbrx-base is worse. Although there might be some output processing errors when using dbrx-base, dbrx-instruct has no problem with endless generation but still fails to achieve a good performance.

Therefore, I want to know whether there is an official test result on PrOntoQA for others to take as a reference.

Thanks!

databricks / dbrx

Bad performance on PrOntoQA benchmark #28