THUDM / LongBench

[ACL 2024] LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
MIT License
675 stars 54 forks source link

The "anwser" for some examples in "qasper.jsonl" is strange #67

Open Zcchill opened 4 months ago

Zcchill commented 4 months ago

I download the data from the offcial url and I found that the "answers" of several examples in "qasper.jsonl" are confusing. Here are several examples: {"pred": "No", "answers": ["Yes", "No"], "all_classes": null, "length": 2317, "input": "Does this method help in sentiment classification task improvement?", "_id": "bcfe56efad9715cc714ffd2e523eaa9ad796a453e7da77a6"} {"pred": "unanswerable", "answers": ["Yes", "Unanswerable"], "all_classes": null, "length": 2284, "actual_length": 3533, "input": "Is jiant compatible with models in any programming language?", "_id": "e5d1d589ddb30f43547012f04b06ac2924a1f4fdcf56daab"} {"pred": "BERTBase", "answers": ["BERTbase", "BERTbase"], "all_classes": null, "length": 3852, "actual_length": 5701, "input": "What BERT model do they test?", "_id": "2a51c07e65a9214ed2cd3c04303afa205e005f4e1ccb172a"}

Zcchill commented 4 months ago

Another example: "_id": "d1aa1132439bd292965634095bf1c9943e062bb6645ff78c". The query is "how many tags do they look at?" The given answer seem to source from "We employ two sources of e-book annotation data: (i) editor tags, and (ii) Amazon search terms. For editor tags, we collect data of 48,705 e-books from 13 publishers, namely Kunstmann, Delius-Klasnig, VUR, HJR, Diogenes, Campus, Kiwi, Beltz, Chbeck, Rowohlt, Droemer, Fischer and Neopubli." But I think the answer of "30 tags" based on "As shown in Table TABREF3 , we collect Amazon review keywords for 2,896 e-books (publishers: Kiwi, Rowohlt, Fischer, and Droemer), which leads to 33,663 distinct review keywords and on average 30 keyword assignments per e-book.\nTag Recommendation Approaches" is better.

bys0318 commented 4 months ago

Thanks for your keen observation. We sample the data directly from the test data of Qasper, we suggest you ask the authors of Qasper.

Zcchill commented 4 months ago

Besides, I would like to replicate the results of "GPT-3.5-Turbo-16k" in paper but get results not so close with the results reported in the paper. I wonder the possible reasons since there is no official code for api method. The results I get are as followed: { "2wikimqa": { "0-4k": 57.09, "4-8k": 42.82, "8k+": 32.71 }, "hotpotqa": { "0-4k": 68.44, "4-8k": 57.25, "8k+": 55.38 }, "multi_news": { "0-4k": 28.57, "4-8k": 23.34, "8k+": 22.31 }, "qasper": { "0-4k": 47.3, "4-8k": 43.97, "8k+": 28.35 }, "multifieldqa_en": { "0-4k": 57.15, "4-8k": 51.67, "8k+": 57.52 }, "gov_report": { "0-4k": 31.79, "4-8k": 28.82, "8k+": 27.34 } } Experiment setting:

  1. I use the api supported by AzureOpenAI.
  2. The system prompt is None. [{"role":"system","content":''}, {"role":"user","content":prompt}]
  3. inference hyper-parameters: completion = client.chat.completions.create( model="gpt-35-turbo-16k", messages=input, temperature=0.0, max_tokens=max_tokens, stop=stop_token, ) response = completion.choices[0].message.content
bys0318 commented 4 months ago

This might be due to the model iteration. We tested the GPT-3.5-Turbo-16k at August, 2023. I think it has a different version now.

Zcchill commented 4 months ago

"You are given a scientific article and a question. Answer the question as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nArticle: {context}\n\n Answer the question based on the above article as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:" The instruction for qasper tasks in dataset2prompt seems redundent, is this a mistake or a deliberate strategy to emphasize the task at both the beginning and the end of a long text (due to position bias)?

bys0318 commented 4 months ago

You're right. We want to emphasize the task instruction, so we insert the instruction at both the start and the end of the input.