StonyBrookNLP / ircot

Repository for Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions, ACL23
https://arxiv.org/abs/2212.10509
Apache License 2.0
154 stars 20 forks source link

Dataset encoding format #20

Open foreverlove944 opened 6 months ago

foreverlove944 commented 6 months ago

What encoding method is used for the data set you provided? I opened it in UTF-8 encoding format. English characters are normal, but Russian and other languages are not normal.

屏幕截图 2024-04-03 205121
HarshTrivedi commented 3 months ago

The contexts/paragraphs were taken from the original source datasets. However, I did apply ftfy at runtime. See commaqa/inference/dataset_readers.py for example. You might want to give it a try.