Closed khairunnisaor closed 2 years ago
Hi Oryza,
Thanks for your interest. I think these changes will do.
Regards, Leyang
Hi Leyang,
I've tried the modification I mentioned above but then I encountered this error
INFO:seq2seq_model:{'eval_loss': 2.992473702499832, 'eval_acc': 0.7222459436379163}██████████▍ | 436/552 [00:00<00:00, 1445.79it/s]
INFO:seq2seq_model:Saving model into outputs/best_model
Epoch 1 of 20: 0%| | 0/20 [00:47<?, ?it/s]
Traceback (most recent call last):
File "train_id.py", line 47, in <module>
model.train_model(train_df, eval_data=eval_df)
File "/home/oryza/playground/templateNER/seq2seq_model.py", line 276, in train_model
**kwargs,
File "/home/oryza/playground/templateNER/seq2seq_model.py", line 634, in train
self._save_model(args.best_model_dir, optimizer, scheduler, model=model, results=results)
File "/home/oryza/playground/templateNER/seq2seq_model.py", line 1077, in _save_model
self.encoder_tokenizer.save_pretrained(output_dir)
File "/home/oryza/.pyenv/versions/templateNER/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1915, in save_pretrained
filename_prefix=filename_prefix,
File "/home/oryza/.pyenv/versions/templateNER/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1948, in _save_pretrained
vocab_files = self.save_vocabulary(save_directory, filename_prefix=filename_prefix)
File "/home/oryza/.pyenv/versions/templateNER/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1968, in save_vocabulary
raise NotImplementedError
NotImplementedError
However, it worked fine when I used the English BART.. do you have any thoughts about this? It started not working when saving the model after first epoch.. or maybe other things that should be adjusted to the new language's pre-trained models, not only the tokenization?
Best, Oryza
I think the previous modification (indobart) adjusted the pre-trained models. This error is due to the tokenizer.
Thank you for your response.
It is true that the IndoBART uses different tokenization than the English BART (IndoBART uses IndoNLG tokenizer). Would you mind suggesting which part of the code about tokenization I need to put more attention to when the model uses a different tokenizer?
Hi,
I've made the Indonesian BART model works with your implementation, but during the prediction in the train.py
, turns out it outputs something like ['adalah entitas tokoh.']
or ['adalah entitas tokoh.. adalah entitas produk. entitas produk.']
, while my template is <tokens> adalah entitas <entity_type> .
. The words, 'tokoh' and 'produk' is the entity types.
Looks like it can't obtain the tokens when doing the entity type predictions.. Does this kind of error make sense to you? I actually kind of struggled to find the cause of this problem though.. It would be very helpful if you can share some thoughts regarding this.
Thank you in advance!
Hi,
thank you for your great contribution to this interesting template NER topic. I wonder if it's possible to adapt this code to another language. I've included the model and tokenizer in the
MODEL_CLASSES
(and other parts) since it has a different tokenizer compared to English BART.Could you share some hints on which part I should put attention to when adding other pre-trained models/language to the code?
Thank you so much for your help!
Best, Oryza