Tencent / NeuralNLP-NeuralClassifier

An Open-source Neural Hierarchical Multi-label Text Classification Toolkit
Other
1.85k stars 406 forks source link

use glove.6B.50d.txt failed : UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2273: character maps to <undefined> #91

Closed SeekPoint closed 3 years ago

SeekPoint commented 3 years ago

python train.py conf/mul_RCNN_train.json Use dataset to generate dict. Size of doc_label dict is 4 Size of doc_token dict is 46121 Size of doc_char dict is 77 Size of doc_token_ngram dict is 0 Size of doc_keyword dict is 0 Size of doc_topic dict is 0 Shrink dict over. Size of doc_label dict is 4 Size of doc_token dict is 36938 Size of doc_char dict is 77 Size of doc_token_ngram dict is 0 Size of doc_keyword dict is 0 Size of doc_topic dict is 0 Load doc_token embedding from data/glove.6B.50d.txt Traceback (most recent call last): File "train.py", line 261, in train(config) File "train.py", line 214, in train model = get_classification_model(model_name, empty_dataset, conf) File "train.py", line 82, in get_classification_model model = globals()[model_name](dataset, conf) File "D:\ghprj\NeuralNLP-NeuralClassifier\model\classification\textrcnn.py", line 27, in init super(TextRCNN, self).init(dataset, config) File "D:\ghprj\NeuralNLP-NeuralClassifier\model\classification\classifier.py", line 46, in init model_mode=dataset.model_mode) File "D:\ghprj\NeuralNLP-NeuralClassifier\model\embedding.py", line 89, in init pretrained_embedding_file) File "D:\ghprj\NeuralNLP-NeuralClassifier\model\embedding.py", line 108, in load_pretrained_embedding for line in fin: File "C:\Program Files\Python37\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2273: character maps to

coderbyr commented 3 years ago

Please check if token's encoding of glove.6B.50d.txt is UTF-8 or not. If not, make sure filter illegal characters and try again.