Franck-Dernoncourt / NeuroNER

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.
http://neuroner.com
MIT License
1.7k stars 475 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 0: invalid continuation byte #164

Closed trinh-hoang-hiep closed 4 years ago

trinh-hoang-hiep commented 4 years ago

When i use with Vietnamese language ; meet this error my word embedding baomoi.vn.model.bin !neuroner --token_pretrained_embedding_filepath=./data/word_vectors/baomoi.vn.model.bin --maximum_number_of_epochs=20 --train_model=True --dataset_text_folder="/content/drive/My Drive/NeuroNER-master/neuroner/data/ner_for_email"

{'character_embedding_dimension': 25, 'character_lstm_hidden_state_dimension': 25, 'check_for_digits_replaced_with_zeros': 1, 'check_for_lowercase': 1, 'dataset_text_folder': '/content/drive/My ' 'Drive/NeuroNER-master/neuroner/data/ner_for_email', 'debug': 0, 'dropout_rate': 0.5, 'experiment_name': 'test', 'fetch_data': '', 'fetch_trained_model': '', 'freeze_token_embeddings': 0, 'gradient_clipping_value': 5.0, 'learning_rate': 0.005, 'load_all_pretrained_token_embeddings': 0, 'load_only_pretrained_token_embeddings': 0, 'main_evaluation_mode': 'conll', 'maximum_number_of_epochs': 20, 'number_of_cpu_threads': 8, 'number_of_gpus': 0, 'optimizer': 'sgd', 'output_folder': './output', 'output_scores': 0, 'parameters_filepath': './parameters.ini', 'patience': 10, 'plot_format': 'pdf', 'pretrained_model_folder': './trained_models/conll_2003_en', 'reload_character_embeddings': 1, 'reload_character_lstm': 1, 'reload_crf': 1, 'reload_feedforward': 1, 'reload_token_embeddings': 1, 'reload_token_lstm': 1, 'remap_unknown_tokens_to_unk': 1, 'spacylanguage': 'en', 'tagging_format': 'bioes', 'token_embedding_dimension': 100, 'token_lstm_hidden_state_dimension': 100, 'token_pretrained_embedding_filepath': './data/word_vectors/baomoi.vn.model.bin', 'tokenizer': 'spacy', 'train_model': 1, 'use_character_lstm': 1, 'use_crf': 1, 'use_pretrained_model': 0, 'verbose': 0} Checking the validity of BRAT-formatted train set... Done. Checking compatibility between CONLL and BRAT for train_compatible_with_brat set ... Done. Checking validity of CONLL BIOES format... Done. Checking the validity of BRAT-formatted valid set... Done. Checking compatibility between CONLL and BRAT for valid_compatible_with_brat set ... Done. Checking validity of CONLL BIOES format... Done. Checking the validity of BRAT-formatted test set... Done. Checking compatibility between CONLL and BRAT for test_compatible_with_brat set ... Done. Checking validity of CONLL BIOES format... Done. Load dataset... Traceback (most recent call last): File "/usr/local/bin/neuroner", line 8, in sys.exit(main()) File "/usr/local/lib/python3.6/dist-packages/neuroner/main.py", line 109, in main nn = neuromodel.NeuroNER(**arguments) File "/usr/local/lib/python3.6/dist-packages/neuroner/neuromodel.py", line 466, in init token_to_vector = self.modeldata.load_dataset(self.dataset_filepaths, self.parameters) File "/usr/local/lib/python3.6/dist-packages/neuroner/dataset.py", line 157, in load_dataset token_to_vector = utils_nlp.load_pretrained_token_embeddings(parameters) File "/usr/local/lib/python3.6/dist-packages/neuroner/utils_nlp.py", line 33, in load_pretrained_token_embeddings for cur_line in file_input: File "/usr/lib/python3.6/codecs.py", line 713, in next return next(self.reader) File "/usr/lib/python3.6/codecs.py", line 644, in next line = self.readline() File "/usr/lib/python3.6/codecs.py", line 557, in readline data = self.read(readsize, firstline=True) File "/usr/lib/python3.6/codecs.py", line 503, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 0: invalid continuation byte Exception ignored in: <bound method NeuroNER.del of <neuroner.neuromodel.NeuroNER object at 0x7f286220b160>> Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/neuroner/neuromodel.py", line 824, in del self.sess.close() AttributeError: 'NeuroNER' object has no attribute 'sess'

Franck-Dernoncourt commented 4 years ago

What was the fix?

On Wed, Jun 10, 2020, 10:09 trinh-hoang-hiep notifications@github.com wrote:

Closed #164 https://github.com/Franck-Dernoncourt/NeuroNER/issues/164.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Franck-Dernoncourt/NeuroNER/issues/164#event-3430649890, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADXY64ER7A5HEXN4MYX6DRV643ZANCNFSM4NMETEHA .