Evaluation of Palm models via bAbi tasks

Do Palm models really ‘understand’ the language? Especially the logical reasoning aspect?

Folks at Facebook AI Research came up with a set of seemingly easy, toy-ish, childish tasks (called ‘bAbi’) to gauge the ‘understanding’ of a Question-Answering system, like any chatbot. Tried them on Palm today and here is the result:

Prompt:

Mary went to the bathroom.
John moved to the hallway.
Mary travelled to the office.
Where is Mary?

Bard Response:

Mary is in the office.

The passage states that Mary went to the bathroom, then John moved to the hallway, and then Mary travelled to the office. This means that Mary is currently in the office. The bathroom and the hallway are both in the same building, but the office is a separate location. Therefore, Mary is not in the bathroom or the hallway, she is in the office.

Can we do set a benchmark test-set to evaluate efficaly of Palm models against these tasks?

Here is evaluation of ChatGPT against the same tasks.

Here is a script to evaluate any (publicly available) LLM against the bAbi tasks. Would you want to incorporate the script? @polong-lin @rominirani

GoogleCloudPlatform / generative-ai

Evaluation of Palm models via bAbi tasks #110