THUDM / LongBench

[ACL 2024] LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
MIT License
679 stars 55 forks source link

Chinese Examples in MultiFieldQA-en #64

Open wendywangwwt opened 6 months ago

wendywangwwt commented 6 months ago

Hi! I'm working on a long document QA problem and looked into the MultiFieldQA-en dataset recently.

I downloaded the dataset using the following code snippet:

from datasets import load_dataset

dataset = load_dataset("THUDM/LongBench",'multifieldqa_en')

While examining the content, I noticed that out of 150 entries, 2 are in Chinese rather than English: Screenshot 2024-05-05 at 4 27 36 PM.

Can you please take a look? Thank you!

bys0318 commented 6 months ago

Hi! They are classified as English samples as they contain more English characters (a-zA-Z) than Chinese characters.