Encoding problem for certain datasets

JonathanJao commented 2 years ago

Hi, I tried to follow the README, but on certain datasets it gives me a failure when I try to build the gym. As per the README instructions, I've gone in and run some of the tasks python scripts individually but I am met with an encoding error for quite a few of them. For instance when I run python mocha.py in the tasks directory, it gives me this error message:

Traceback (most recent call last):
  File "mocha.py", line 34, in <module>
    main()
  File "mocha.py", line 31, in main
    train, dev, test = dataset.generate_k_shot_data(k=16, seed=seed, path="../data/")
  File "/scratch/jj3706/CrossFit/tasks/fewshot_gym_dataset.py", line 59, in generate_k_shot_data
    self.write_to_tsv(k_shot_train, prefix + "_train.tsv")
  File "/scratch/jj3706/CrossFit/tasks/fewshot_gym_dataset.py", line 8, in write_to_tsv
    fout.write("{}\t{}\n".format(line[0], line[1]))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 599: ordinal not in range(128)

or when I run python anli.py it gives:

Traceback (most recent call last):
  File "anli.py", line 40, in <module>
    main()
  File "anli.py", line 37, in main
    train, dev, test = dataset.generate_k_shot_data(k=16, seed=seed, path="../data/")
  File "/scratch/jj3706/CrossFit/tasks/fewshot_gym_dataset.py", line 59, in generate_k_shot_data
    self.write_to_tsv(k_shot_train, prefix + "_train.tsv")
  File "/scratch/jj3706/CrossFit/tasks/fewshot_gym_dataset.py", line 8, in write_to_tsv
    fout.write("{}\t{}\n".format(line[0], line[1]))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 47-54: ordinal not in range(128)

This occurs for about half the datasets available from what I can tell, and the other half seem to give no errors and are marked as successes when building the AI gym. Any details on how to fix this would be appreciated!

cherry979988 commented 2 years ago

Thanks @JonathanJao for bringing this to our attention. Let me take a look into this!

cherry979988 commented 2 years ago

Hi @JonathanJao

While I am working on this, could you please kindly provide the following information?

System
Python version
Output by running python -c 'import sys; print(sys.getdefaultencoding())'

Also could you please try this fix on your side for me? -- Go to tasks/fewshot_gym_dataset.py line 6 https://github.com/INK-USC/CrossFit/blob/08e6381e967c065c0d10b99e89dcb9ec2a583d86/tasks/fewshot_gym_dataset.py#L6

Change this line to with open(out_file, "w", encoding="utf-8") as fout: and rerun the dataset building scripts. Let me know if this works.

JonathanJao commented 2 years ago

Hi thanks for replying! For the details you requested,

> python -c 'import sys; print(sys.getdefaultencoding())'
utf-8
> python --version
Python 3.6.9 :: Anaconda, Inc.
> lsb_release -a
LSB Version:    core-9.20170808ubuntu1-noarch:security-9.20170808ubuntu1-noarch
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.5 LTS
Release:    18.04
Codename:   bionic

I haven't tried the line you sent over yet, but what worked for me seemed to be the following:

> git diff .
diff --git a/tasks/fewshot_gym_dataset.py b/tasks/fewshot_gym_dataset.py
index 5b21d85..75013cc 100644
--- a/tasks/fewshot_gym_dataset.py
+++ b/tasks/fewshot_gym_dataset.py
@@ -5,7 +5,7 @@ class FewshotGymDataset():
     def write_to_tsv(self, lst, out_file):
         with open(out_file, "w") as fout:
             for line in lst:
-                fout.write("{}\t{}\n".format(line[0], line[1]))
+                fout.write("{}\t{}\n".format(str(line[0]).encode('utf-8'), str(line[1]).encode('utf-8')))

 class FewshotGymClassificationDataset(FewshotGymDataset):

@@ -104,4 +104,4 @@ class FewshotGymTextToTextDataset(FewshotGymDataset):
             self.write_to_tsv(k_shot_dev, prefix + "_dev.tsv")
             self.write_to_tsv(k_shot_test, prefix + "_test.tsv")

-        return k_shot_train, k_shot_dev, k_shot_test
\ No newline at end of file
+        return k_shot_train, k_shot_dev, k_shot_test

INK-USC / CrossFit

Encoding problem for certain datasets #6