Nealcly / templateNER

Source code for template-based NER
208 stars 39 forks source link

Implementation for other language #15

Closed khairunnisaor closed 2 years ago

khairunnisaor commented 2 years ago

Hi,

thank you for your great contribution to this interesting template NER topic. I wonder if it's possible to adapt this code to another language. I've included the model and tokenizer in the MODEL_CLASSES (and other parts) since it has a different tokenizer compared to English BART.

MODEL_CLASSES = {
    "auto": (AutoConfig, AutoModel, AutoTokenizer),
    "bart": (BartConfig, BartForConditionalGeneration, BartTokenizer),
    "bert": (BertConfig, BertModel, BertTokenizer),
    "roberta": (RobertaConfig, RobertaModel, RobertaTokenizer),
    "indobart": (MBartConfig, MBartForConditionalGeneration, IndoNLGTokenizer)
}

Could you share some hints on which part I should put attention to when adding other pre-trained models/language to the code?

Thank you so much for your help!

Best, Oryza

Nealcly commented 2 years ago

Hi Oryza,

Thanks for your interest. I think these changes will do.

Regards, Leyang

khairunnisaor commented 2 years ago

Hi Leyang,

I've tried the modification I mentioned above but then I encountered this error

INFO:seq2seq_model:{'eval_loss': 2.992473702499832, 'eval_acc': 0.7222459436379163}██████████▍                       | 436/552 [00:00<00:00, 1445.79it/s]
INFO:seq2seq_model:Saving model into outputs/best_model
Epoch 1 of 20:   0%|                                                                                                              | 0/20 [00:47<?, ?it/s]
Traceback (most recent call last):
  File "train_id.py", line 47, in <module>
    model.train_model(train_df, eval_data=eval_df)
  File "/home/oryza/playground/templateNER/seq2seq_model.py", line 276, in train_model
    **kwargs,
  File "/home/oryza/playground/templateNER/seq2seq_model.py", line 634, in train
    self._save_model(args.best_model_dir, optimizer, scheduler, model=model, results=results)
  File "/home/oryza/playground/templateNER/seq2seq_model.py", line 1077, in _save_model
    self.encoder_tokenizer.save_pretrained(output_dir)
  File "/home/oryza/.pyenv/versions/templateNER/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1915, in save_pretrained
    filename_prefix=filename_prefix,
  File "/home/oryza/.pyenv/versions/templateNER/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1948, in _save_pretrained
    vocab_files = self.save_vocabulary(save_directory, filename_prefix=filename_prefix)
  File "/home/oryza/.pyenv/versions/templateNER/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1968, in save_vocabulary
    raise NotImplementedError
NotImplementedError

However, it worked fine when I used the English BART.. do you have any thoughts about this? It started not working when saving the model after first epoch.. or maybe other things that should be adjusted to the new language's pre-trained models, not only the tokenization?

Best, Oryza

Nealcly commented 2 years ago

I think the previous modification (indobart) adjusted the pre-trained models. This error is due to the tokenizer.

khairunnisaor commented 2 years ago

Thank you for your response.

It is true that the IndoBART uses different tokenization than the English BART (IndoBART uses IndoNLG tokenizer). Would you mind suggesting which part of the code about tokenization I need to put more attention to when the model uses a different tokenizer?

khairunnisaor commented 2 years ago

Hi,

I've made the Indonesian BART model works with your implementation, but during the prediction in the train.py, turns out it outputs something like ['adalah entitas tokoh.'] or ['adalah entitas tokoh.. adalah entitas produk. entitas produk.'], while my template is <tokens> adalah entitas <entity_type> .. The words, 'tokoh' and 'produk' is the entity types.

Looks like it can't obtain the tokens when doing the entity type predictions.. Does this kind of error make sense to you? I actually kind of struggled to find the cause of this problem though.. It would be very helpful if you can share some thoughts regarding this.

Thank you in advance!