UnicodeDecodeError: 'ascii' codec can't decode byte ...

zszazi commented 5 years ago

❓ Questions and Help

I am facing this issue , When I run the VQA colab notebook it works perfectly well , but when I run the same notebook in a cloud service provide in jupyter lab env it throws me this error but the Pythia Captioning notebook runs without throwing an error in the cloud service provider's jupyter lab UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1496: ordinal not in range(128)

Python version - 3.6.8 Environment - Jupyter Lab

The Full error message

`/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:15: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  from ipykernel import kernelapp as app

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-13-747420086974> in <module>
----> 1 greek_god = PythiaDemo()

<ipython-input-12-b27d4b9a37c8> in __init__(self)
      6 
      7     def __init__(self):
----> 8         self._init_processors()
      9         self.pythia_model = self._build_pythia_model()
     10         self.detection_model = self._build_detection_model()

<ipython-input-12-b27d4b9a37c8> in _init_processors(self)
     36 
     37         self.text_processor = \
---> 38             VocabProcessor(text_processor_config.params)
     39         self.answer_processor = \
     40             VQAAnswerProcessor(answer_processor_config.params)

/content/pythia/pythia/tasks/processors.py in __init__(self, config, *args, **kwargs)
    210             )
    211 
--> 212         self.vocab = Vocab(*args, **config.vocab, **kwargs)
    213         self._init_extras(config)
    214 

/content/pythia/pythia/utils/vocab.py in __init__(self, *args, **params)
     36                 raise ValueError("No vocab path or embedding_name passed for vocab")
     37 
---> 38             self.vocab = IntersectedVocab(*args, **params)
     39 
     40         elif vocab_type == "extracted":

/content/pythia/pythia/utils/vocab.py in __init__(self, vocab_file, embedding_name, *args, **kwargs)
    268             mentioned above
    269         """
--> 270         super(IntersectedVocab, self).__init__(vocab_file, *args, **kwargs)
    271 
    272         self.type = "intersected"

/content/pythia/pythia/utils/vocab.py in __init__(self, vocab_file, embedding_dim, data_root_dir, *args, **kwargs)
    122 
    123             with open(vocab_file, "r") as f:
--> 124                 for line in f:
    125                     self.itos[index] = line.strip()
    126                     self.word_dict[line.strip()] = index

/opt/conda/lib/python3.6/encodings/ascii.py in decode(self, input, final)
     24 class IncrementalDecoder(codecs.IncrementalDecoder):
     25     def decode(self, input, final=False):
---> 26         return codecs.ascii_decode(input, self.errors)[0]
     27 
     28 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1496: ordinal not in range(128)`

This occurs in the cell when I try to create the object for pythia demo class

@apsdehal Would love to know your thoughts too!!

apsdehal commented 5 years ago

@zszazi Can you try the suggestion here: https://stackoverflow.com/a/40346898/2428889?

apsdehal commented 5 years ago

Alternatively, this can also be done: https://stackoverflow.com/a/46851339/2428889

zszazi commented 5 years ago

@apsdehal I did try both of this before posting the issue here and none of these worked, Although when I checked the default encoding seems to be 'utf-8' in the Jupyter lab

apsdehal commented 5 years ago

I am unable to reproduce it but can you change line 124 in vocab.py to be something like:

lines = f.readlines()
lines = [line.decode("utf-8").strip("\n") for line in lines]
for line in lines:

This is definitely an issue related to utf8-decoding and it would great if you can isolate out the main issue. For isolation and for making sure that the issue is with the decoding, you can directly try loading the vocabulary_100k.txt which is causing the issue. Try:

with open("vocabulary_100k.txt", "r") as f:
      for line in f:
            continue

Check if this causes any issues?

zszazi commented 5 years ago

@apsdehal I don't know what's happening but this is an interesting problem I downloaded and tried to open the vocabulary_100k.txt , it works perfectly with specifying encoding

but the same error pops up when I do it in Pythia about the

`/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:14: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-33-747420086974> in <module>
----> 1 greek_god = PythiaDemo()

<ipython-input-32-c3e13d364a8e> in __init__(self)
      5 
      6   def __init__(self):
----> 7     self._init_processors()
      8     self.pythia_model = self._build_pythia_model()
      9     self.detection_model = self._build_detection_model()

<ipython-input-32-c3e13d364a8e> in _init_processors(self)
     28     answer_processor_config.params.vocab_file = "/content/model_data/answers_vqa.txt"
     29     # Add preprocessor as that will needed when we are getting questions from user
---> 30     self.text_processor = VocabProcessor(text_processor_config.params)
     31     self.answer_processor = VQAAnswerProcessor(answer_processor_config.params)
     32 

/content/pythia/pythia/tasks/processors.py in __init__(self, config, *args, **kwargs)
    210             )
    211 
--> 212         self.vocab = Vocab(*args, **config.vocab, **kwargs)
    213         self._init_extras(config)
    214 

/content/pythia/pythia/utils/vocab.py in __init__(self, *args, **params)
     36                 raise ValueError("No vocab path or embedding_name passed for vocab")
     37 
---> 38             self.vocab = IntersectedVocab(*args, **params)
     39 
     40         elif vocab_type == "extracted":

/content/pythia/pythia/utils/vocab.py in __init__(self, vocab_file, embedding_name, *args, **kwargs)
    268             Embedding name picked up from the list of the pretrained aliases
    269             mentioned above
--> 270         """
    271         super(IntersectedVocab, self).__init__(vocab_file, *args, **kwargs)
    272 

/content/pythia/pythia/utils/vocab.py in __init__(self, vocab_file, embedding_dim, data_root_dir, *args, **kwargs)
    122 
    123             with open("vocabulary_100k.txt", "r",encoding="utf-8", errors='ignore') as f:
--> 124                 for line in f:
    125                     line = [line.decode("utf-8").strip("\n") for line in lines]
    126                     self.itos[index] = line.strip()

/opt/conda/lib/python3.6/encodings/ascii.py in decode(self, input, final)
     24 class IncrementalDecoder(codecs.IncrementalDecoder):
     25     def decode(self, input, final=False):
---> 26         return codecs.ascii_decode(input, self.errors)[0]
     27 
     28 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1496: ordinal not in range(128)`

Is there any other way other than manually removing non-ascii symbols from vocabulary_100k.txt?

zszazi commented 5 years ago

I removed all the non-ascii characters from the vocabulary_100k.txt file using the command tr -dc '\0-\177' <vocabulary_100k.txt > newfile.txt and then load the newfile.txt in the vocabs.py file

apsdehal commented 5 years ago

This will mess up the initial embedding and won't work. Can you look at the replacement I mentioned above. This needs to be fixed properly for #105 to be fixed. Can you do the exact replacement I mentioned above to:

                for line in f:
    125                     line = [line.decode("utf-8").strip("\n") for line in lines]

lines = f.readlines()
lines = [line.decode("utf-8").strip("\n") for line in lines]
for line in lines:

zszazi commented 5 years ago

Thanks That solved the unicodedecode error but gave this error'str' object has no attribute 'decode

/content/pythia/pythia/utils/vocab.py in __init__(self, vocab_file, embedding_dim, data_root_dir, *args, **kwargs)
    123             with open(vocab_file, "r",encoding="utf-8") as f:
    124                 lines = f.readlines()
--> 125                 lines = [line.decode("utf-8").strip("\n") for line in lines]
    126                 for line in lines:
    127                     self.itos[index] = line.strip()

/content/pythia/pythia/utils/vocab.py in <listcomp>(.0)
    123             with open(vocab_file, "r",encoding="utf-8") as f:
    124                 lines = f.readlines()
--> 125                 lines = [line.decode("utf-8").strip("\n") for line in lines]
    126                 for line in lines:
    127                     self.itos[index] = line.strip()

AttributeError: 'str' object has no attribute 'decode'

apsdehal commented 5 years ago

Can you remove .decode("utf-8").strip("\n") to .strip("\n") only and check if it works?

zszazi commented 5 years ago

I tried it doesn't work.. Maybe it's some jupyter lab internals issue

apsdehal commented 5 years ago

Can you post the error here? Also, can you post the output of python -m torch.utils.collect_env from where you run your jupyter notebook?

zszazi commented 5 years ago

python -m torch.utils.collect_env Collecting environment information... PyTorch version: 1.1.0 Is debug build: No CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 16.04.6 LTS GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609 CMake version: version 3.5.1

Python version: 3.6 Is CUDA available: Yes CUDA runtime version: 10.0.130 GPU models and configuration: GPU 0: GeForce GTX 1080 Ti Nvidia driver version: 418.40.04 cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.5.0

Versions of relevant libraries: [pip] msgpack-numpy==0.4.3.2 [pip] numpy==1.16.3 [pip] pytorch-ignite==0.2.0 [pip] torch==1.1.0 [pip] torchvision==0.2.2 [conda] blas 1.0 mkl [conda] ignite 0.2.0 py36_0 pytorch [conda] mkl 2019.3 199 [conda] mkl_fft 1.0.12 py36ha843d7b_0 [conda] mkl_random 1.0.2 py36hd81dba3_0 [conda] pytorch 1.1.0 py3.6_cuda10.0.130_cudnn7.5.1_0 pytorch [conda] torchvision 0.2.2 py_3 pytorch

I get the same error if I remove the line which you had mentioned

apsdehal commented 5 years ago

Ok let's try this, it seems to work:

Replace

with open(vocab_file, "r") as f:
    lines = f.readlines()
    lines = [line.strip("\n") for line in lines]
    for line in lines

with

with open(vocab_file, "r", encoding="utf-8", errors="ignore") as f:
    lines = f.readlines()
    lines = [line.strip("\n") for line in lines]
    for line in lines:

facebookresearch / mmf

UnicodeDecodeError: 'ascii' codec can't decode byte ... #101

❓ Questions and Help