JBoRu / StructGPT

The code and data for "StructGPT: A general framework for Large Language Model to Reason on Structured Data"
Apache License 2.0
97 stars 39 forks source link

The experimental results of ChatGPT on TabFact #1

Closed SivilTaram closed 1 year ago

SivilTaram commented 1 year ago

Thanks for your work on integrating large language models with natural language interfaces! I'm a little bit shocked by the performance of ChatGPT on TabFact, as it is really high, much higher than the performance I got from testing in March. Also, for the WikiSQL-Weak dataset.

image

So I'm wondering the following information, and would much appreciate if you can provide more details:

Thanks in advance for the sharing!

Best, Qian

JBoRu commented 1 year ago

Hi, Dr. Liu Thanks for your attention. We query ChatGPT to test its zero-shot performance in May. For our hyperparameter settings, we only adjusted the temperature to 0 to ensure reproducibility across multiple experiments; all other parameters were left at their default values:

{ temperature=0, max_tokens=350, top_p=1, frequency_penalty=0, presence_penalty=0, }

In order to ensure a fair comparison between the baseline ChatGPT and our proposed StructGPT, we use the same prompt for the final question answering. We use the following prompt on TabFact:

The ordered list below shows the rows of a table, with each line displaying a different row along with its corresponding column name and value pairs. The format for each pair is (column name, value). The table contains: {table_content} According to the table, is the statement "{question}" ture? If you think the statement is true, output "Yes". If you think the statement is not ture, output "No".

For the table_content, we directly serialize each row and concate them with "\n" as follows:

item 1: (date, january 2); (visitor, detroit); (score, 4 - 1); (home, carolina); (decision, joseph); (attendance, 17053); (record, 24 - 12 - 4 - 1) item 2: (date, january 3); (visitor, anaheim); (score, 1 - 3); (home, detroit); (decision, legace); (attendance, 20066); (record, 25 - 12 - 4 - 1) ... item 14: (date, january 31); (visitor, carolina); (score, 4 - 4); (home, detroit); (decision, legace); (attendance, 20066); (record, 30 - 15 - 8 - 2)

We think the prompt design may be an important factor affecting the results of ChatGPT. Besides, the label verbalizer for normalizing the prediction of ChatGPT may also be a factor. We require ChatGPT to generate only "Yes" or "No" for each query, but it sometimes generates additional explanations. Since manually checking each prediction is unaffordable, we design a label verbalizer function to normalize ChatGPT's predictions for later evaluation. The label verbalizer function and final evaluation function are described below:

def label_verbalizer(prediction_response):
      response = prediction_response.lower()
      if 'yes' in response and  'no' in response:
           final_answer = 'unknown'
      elif 'yes' in response:
           final_answer = 'entailed'
      elif 'no' in response:
           final_answer='refuted'
      else:
           final_answer = 'unknown'

def judge_prediction(prediction, gold):
    # prediction maybe one of entailed, refuted, and unknown
    # gold maybe either entailed or refuted

    if prediction.strip() == gold.strip():
         return True
    else:
     return False

For the result on WikiSQL, we think the prompt design and verbalizer function are also important factors affecting the results of ChatGPT. We will update the full data and codes as soon as possible. Hope to solve your confusion!

Best, jinhao

SivilTaram commented 1 year ago

@JBoRu Thanks Jinhao for the detailed explanation! Yes I also face the problem of additional explanation issue so I choose to prompt the model to strictly follow my instruction to output, without any explanation. I think one key difference may be the table representation, where you employ the (header, value) form, while our paper takes the straightforward format like markdown. I will try to use the same prompt to see if the model performance is improved from Mar to May (it is also possible). Attached some info from my side for your reference:

The following table is about {caption}.

Table: {table}

Statement: {question}

And a table example is as below:

date | result | score | brazil scorers | competition may 11 , 1919 | w | 6 - 0 | friedenreich (3) , neco (2) , haroldo | south american championship may 18 , 1919 | w | 6 - 1 | heitor , amílcar (4), millon | south american championship may 26 , 1919 | w | 5 - 2 | neco (5) | south american championship may 30 , 1919 | l | 1 - 2 | jesus (1) | south american championship june 2nd , 1919 | l | 0 - 2 | - | south american championship



BTW, do you employ similar table representations for WikiSQL & WikiTableQuestions, also? Thanks again for your quick response!
JBoRu commented 1 year ago

Hi, Dr. Liu. The design of table serialization is inspired by KG triple serialization. And we keep it the same across three TableQA datasets. Therefore, you can test whether it is a useful prompt.