Alibaba-NLP / ACE

[ACL-IJCNLP 2021] Automated Concatenation of Embeddings for Structured Prediction
296 stars 44 forks source link

AttributeError: 'NoneType' object has no attribute 'tokenize' #41

Open junwei-h opened 1 year ago

junwei-h commented 1 year ago

Hello, Hit an error while running python .\ --config .\config\doc_ner_best.yaml --batch_size 1 --parse --target_dir .\datasets\mytest --keep_order on Windows 10, Python 3.7, no GPU.

Here is the error message: 2022-07-28 14:35:50,789 Reading data from datasets\mytest 2022-07-28 14:35:50,789 Train: datasets\mytest\doc_train.txt 2022-07-28 14:35:50,789 Dev: None 2022-07-28 14:35:50,791 Test: None Traceback (most recent call last): File ".\", line 345, in train_eval_result, train_loss = student.evaluate(loader,out_path=Path('outputs/train.'+config.config['model_name']+'.'+tar_file_name+'.conllu'),embeddings_storage_mode="none",prediction_mode=True) File "C:\Users\ebb\ACE\flair\models\", line 2218, in evaluate features = self.forward(batch,prediction_mode=prediction_mode) File "C:\Users\ebb\ACE\flair\models\", line 818, in forward self.embeddings.embed(sentences,embedding_mask=self.selection) File "C:\Users\ebb\ACE\flair\", line 184, in embed embedding.embed(sentences) File "C:\Users\ebb\ACE\flair\", line 97, in embed self._add_embeddings_internal(sentences) File "C:\Users\ebb\ACE\flair\", line 2962, in _add_embeddings_internal self._add_embeddings_to_sentences(sentences) File "C:\Users\ebb\ACE\flair\", line 3051, in _add_embeddings_to_sentences subtokenized_sentence = self.tokenizer.tokenize(tokenized_string) AttributeError: 'NoneType' object has no attribute 'tokenize'

The error is trigged by this line: because self.tokenizer is None.

Any suggestions how to debug this issue? Thanks.

btw, the content of doc_train.txt is the following gibberish: -DOCSTART- O

Amazon O predict O Paypal O and O do O 7-11 O for O Canada O and O Hongkong O

wangxinyu0922 commented 1 year ago

Sometimes the embeddings cannot read the saved tokenizer correctly. I add some lines in to fix this issue.

junwei-h commented 1 year ago

Still have error using the updated Here is the error message:

[2022-07-29 21:50:39,054 INFO] loading file from cache at C:\Users\ebb/.cache\torch\transformers\5b125ba222ff82664771f63cd8fac9696c24b403fc1ab720d537fe2ceaaf0576.8b10bd978b5d01c21303cc761fc9ecd464419b3bf921864a355ba807cfbfafa8
Traceback (most recent call last):
  File ".\", line 206, in <module>
  File "C:\Users\ebb\.conda\envs\ace_py37\lib\site-packages\torch\nn\modules\", line 585, in __getattr__
    type(self).__name__, name))
AttributeError: 'TransformerWordEmbeddings' object has no attribute 'add_special_tokens'

I had if '/' in name: name = name.split('/')[-1], otherwise I have the error in issue

wangxinyu0922 commented 1 year ago

Oops, this line is not needed in the code, fixed it.

junwei-h commented 1 year ago

Another error:

2022-08-01 17:33:40,842 Reading data from datasets\mytest
2022-08-01 17:33:40,843 Train: datasets\mytest\doc_train.txt
2022-08-01 17:33:40,844 Dev: None
2022-08-01 17:33:40,844 Test: None
Traceback (most recent call last):
  File ".\", line 368, in <module>
    train_eval_result, train_loss = student.evaluate(loader,out_path=Path('outputs/train.'+config.config['model_name']+'.'+tar_file_name+'.conllu'),embeddings_storage_mode="none",prediction_mode=True)
  File "C:\Users\ebb\ACE\flair\models\", line 2212, in evaluate
    features = self.forward(batch,prediction_mode=prediction_mode)
  File "C:\Users\ebb\ACE\flair\models\", line 818, in forward
  File "C:\Users\ebb\ACE\flair\", line 184, in embed
  File "C:\Users\ebb\ACE\flair\", line 97, in embed
  File "C:\Users\ebb\ACE\flair\", line 2943, in _add_embeddings_internal
    self.add_document_embeddings_v2(sentences, max_sequence_length = model_max_length, batch_size = 32 if not hasattr(self,'doc_batch_size') else self.doc_batch_size)
  File "C:\Users\ebb\ACE\flair\", line 3570, in add_document_embeddings_v2
    for doc_pos, doc_sent in enumerate(sentence.doc):
AttributeError: 'Sentence' object has no attribute 'doc'

EDIT (debug info):'/home/yongjiang.jy/.flair/embeddings/en-xlmr-first-docv2_10epoch_1batch_4accumulate_0.000005lr_10000lrrate_eng_monolingual_nocrf_fast_norelearn_sentbatch_sentloss_finetune_nodev_saving_ner5/roberta-large' and sentence.batch_pos={}, thus runs the else branch and sentence= Sentence: "-DOCSTART-" - 1 Tokens, has no doc attribute.

wangxinyu0922 commented 1 year ago

I didn't consider the scenario for document-level ACE for prediction. I add some tricks in to fix it, you may try it. Moreover, you can also change the test file in the conll 03 dataset to your own testing file (doc_train.txt) and use the --test command for prediction. Note: --test command works if you want to predict only thousands of sentences. If you want to predict millions of sentences, I still suggest using --parse command since it is more CPU memory friendly.

junwei-h commented 1 year ago

Sorry, I still have the same error as before AttributeError: 'Sentence' object has no attribute 'doc'. See the debug info in the previous post.

junwei-h commented 1 year ago

Using the latest code and running python .\ --config .\config\doc_ner_best.yaml --batch_size 1 --parse --target_dir .\datasets\mytest --keep_order on Windows 10, Python 3.7, no GPU, I have another error:

Setting embedding mask to the best action: tensor([1., 1., 0., 1., 0., 0., 1., 0., 0., 1., 1., 1.])
2022-08-19 10:56:10,298 Reading data from datasets\mytest
2022-08-19 10:56:10,298 Train: datasets\mytest\doc_train.txt
2022-08-19 10:56:10,298 Dev: None
2022-08-19 10:56:10,299 Test: None
Traceback (most recent call last):
  File ".\", line 379, in <module>
    corpus_data = trainer.assign_corpus(corpus = corpus.train, set_name= args.set_name, corpus_name = args.corpus_name, train_with_doc = True, pretrained_file_dict = config.config['ReinforcementTrainer']['pretrained_file_dict'])
  File "C:\Users\ACE\flair\trainers\", line 1402, in assign_corpus
AttributeError: 'Subset' object has no attribute 'reset_sentence_count