GoogleCloudPlatform / generative-ai

Sample code and notebooks for Generative AI on Google Cloud, with Gemini on Vertex AI
https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview
Apache License 2.0
6.94k stars 1.87k forks source link

Evaluation of Palm models via bAbi tasks #110

Closed yogeshhk closed 1 year ago

yogeshhk commented 1 year ago

Do Palm models really ‘understand’ the language? Especially the logical reasoning aspect?

Folks at Facebook AI Research came up with a set of seemingly easy, toy-ish, childish tasks (called ‘bAbi’) to gauge the ‘understanding’ of a Question-Answering system, like any chatbot. Tried them on Palm today and here is the result:

Prompt:

Mary went to the bathroom.
John moved to the hallway.
Mary travelled to the office.
Where is Mary?

Bard Response:

Mary is in the office.

The passage states that Mary went to the bathroom, then John moved to the hallway, and then Mary travelled to the office. This means that Mary is currently in the office. The bathroom and the hallway are both in the same building, but the office is a separate location. Therefore, Mary is not in the bathroom or the hallway, she is in the office.

Can we do set a benchmark test-set to evaluate efficaly of Palm models against these tasks?

Here is evaluation of ChatGPT against the same tasks.

Here is a script to evaluate any (publicly available) LLM against the bAbi tasks. Would you want to incorporate the script? @polong-lin @rominirani

polong-lin commented 1 year ago

Hi @yogeshhk, that sounds like an interesting thing to see, but comparing LLMs from different providers is not something we'll be looking to incorporate in this Github repo. Thanks for the suggestion though!