ThilinaRajapakse / simpletransformers

Transformers for Information Retrieval, Text Classification, NER, QA, Language Modelling, Language Generation, T5, Multi-Modal, and Conversational AI
https://simpletransformers.ai/
Apache License 2.0
4.09k stars 728 forks source link

Tokenizer documentation + Moving back to Transformers library #511

Closed hbeychaner closed 4 years ago

hbeychaner commented 4 years ago

Is your feature request related to a problem? Please describe. I'm having a lot of difficulty understanding how to include a tokenizer in the model from the documentation. Trying to load tokenizers from Huggingface leads to errors (TypeError: forward() got an unexpected keyword argument 'token_type_ids', for example). The second related issue is once a model has been trained, how to we then use that model from Transformers? Loading the model works, but outputs don't have labels.

Describe the solution you'd like It would be great to have an example of training/fine-tuning within simpletransformers, then moving back to Transformers to run inference; same model, same pipeline, etc. Ideally, I'd like to use a model in production which is easier to do from within the Transformers library.

Describe alternatives you've considered Tried to fine-tune using Transformers and ran into endless bugs. Simpletransformers made it significantly easier to fine-tune my model. I've considered coding a tokenizer, but my data and model are multi-lingual and tokenizers already exist for these cases; I'm just not sure from the documentation how to actually use them from within Simpletransformers.

Additional context I fine-tune a multilingual DistilBERT model on NER in Chinese. It achieves great results, but I have to manually script a sentencizer and tokenizer for it to work properly. I'm not able to locate more information on this in Simpletransformer's documentation.

ThilinaRajapakse commented 4 years ago

Tokenizers are a little messy at the moment. The current implementation will assume the tokenizer is of the Huggingface tokenizer class associated with the model.

MODEL_CLASSES = {
    "auto": (AutoConfig, AutoModelWithLMHead, AutoTokenizer),
    "bert": (BertConfig, BertForMaskedLM, BertTokenizer),
    "camembert": (CamembertConfig, CamembertForMaskedLM, CamembertTokenizer),
    "distilbert": (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer),
    "electra": (ElectraConfig, ElectraForLanguageModelingModel, ElectraTokenizer),
    "gpt2": (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
    "longformer": (LongformerConfig, LongformerForMaskedLM, LongformerTokenizer),
    "openai-gpt": (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
    "roberta": (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer),
}

If you have a pre-trained Chinese tokenizer saved in the format that HugginFace uses, you can load it by providing the path to the tokenizer files as tokenizer_name.

Alternatively, you can also train a new tokenizer on your own data (you just have to provide the train_files argument) if you are training the model from scratch.

For the most part, Simple Transformers and Hugging Face Transformers are compatible with each other. But, there are certain models and certain tasks that may only be available in one of the libraries. Distilbert is available in both though, so the trained model files should be compatible with either of the libraries. However, I am not too familiar with how the Pipelines are implemented in Hugging Face so I can't say for sure.

Would you mind elaborating on why it's easier to use a model in production from Transformers?

hbeychaner commented 4 years ago

Thanks for the response! The reason I'm reaching out about moving back to Transformers is: 1) The model.predict() method, for whatever reason, just refuses to tokenize text. I load the huggingface multilingual distilbert model through simpletransformers, train it on NER data in Chinese, and then try to run predict on a single Chinese sentence. The tokenizer just doesn't function and the whole string is treated as a single token; not sure why that is. It similarly does not tokenize English, Norwegian, or other texts I tested it on. 2) I'm just more familiar with Transformers and have a few packages that act as wrappers for that library. If there's a simple way to serve a model from simpletransformers without breaking the tokenizer, I'm happy to switch :)

ThilinaRajapakse commented 4 years ago

I incorrectly assumed that you were doing Language Model training (despite you saying you are doing NER).

There shouldn't be any issues with the predict function with the NER models. The English models should work for sure. Can you share a minimal script where you load the model and call predict()?

hbeychaner commented 4 years ago

Sure! The model has been trained and seems to work fine without any additional tokenizer trouble. But once the model is loaded from disk, tokenization seems not to work properly (only tokenizing by space, so that if I manually separate words in other languages by space, it works alright...)

from simpletransformers.ner import NERModel

# Create a NERModel
model = NERModel('distilbert', '/home/acorn/distilbert-multi-chinese-ner/', use_cuda=False)
predictions, raw_outputs = model.predict(["Today, Apple Inc. announced their intent to go to Milwaukee and wear their coats on Tuesday."])
print(predictions)
[[{'Today,': 'O'}, {'Apple': 'B-ORG'}, {'Inc.': 'I-ORG'}, {'announced': 'O'}, {'their': 'O'}, {'intent': 'O'}, {'to': 'O'}, {'go': 'O'}, {'to': 'O'}, {'Milwaukee': 'B-LOC'}, {'and': 'O'}, {'wear': 'O'}, {'their': 'O'}, {'coats': 'O'}, {'on': 'O'}, {'Tuesday.': 'O'}]]
predictions, raw_outputs = model.predict(["トランプ大統領 は アメリカ 合衆国 の 大統領 です が、 ホワイト ハウス に 住む代わり に、 マルアラーゴ に ある 彼 の リゾート 地から アップル社 で 働く こと に 時間 を 費やす こと を 好みます。"])
print(predictions)
[[{'トランプ大統領': 'B-PER'}, {'は': 'O'}, {'アメリカ': 'B-LOC'}, {'合衆国': 'I-LOC'}, {'の': 'O'}, {'大統領': 'O'}, {'です': 'O'}, {'が、': 'O'}, {'ホワイト': 'B-LOC'}, {'ハウス': 'I-LOC'}, {'に': 'O'}, {'住む代わり': 'O'}, {'に、': 'O'}, {'マルアラーゴ': 'B-LOC'}, {'に': 'O'}, {'ある': 'O'}, {'彼': 'O'}, {'の': 'O'}, {'リゾート': 'O'}, {'地から': 'O'}, {'アップル社': 'B-ORG'}, {'で': 'O'}, {'働く': 'O'}, {'こと': 'O'}, {'に': 'O'}, {'時間': 'O'}, {'を': 'O'}, {'費やす': 'O'}, {'こと': 'O'}, {'を': 'O'}, {'好みます。': 'O'}]]
predictions, raw_outputs = model.predict(["トランプ大統領 は アメリカ 合衆国 の 大統領 です が、 ホワイト ハウス に 住む代わり に、 マルアラーゴ に ある 彼 の リゾート 地から アップル社 で 働く こと に 時間 を 費やす こと を 好みます。".replace(" ","")])
print(predictions)
[[{'トランプ大統領はアメリカ合衆国の大統領ですが、ホワイトハウスに住む代わりに、マルアラーゴにある彼のリゾート地からアップル社で働くことに時間を費やすことを好みます。': 'B-PER'}]]
predictions, raw_outputs = model.predict(["我 在 江 西 农 业 大 学 学 习 , 但 我 的 朋 友 小 王 想 去 西 安"])
print(predictions)
[[{'我': 'O'}, {'在': 'O'}, {'江': 'B-ORG'}, {'西': 'I-ORG'}, {'农': 'I-ORG'}, {'业': 'I-ORG'}, {'大': 'I-ORG'}, {'学': 'I-ORG'}, {'学': 'O'}, {'习': 'O'}, {',': 'O'}, {'但': 'O'}, {'我': 'O'}, {'的': 'O'}, {'朋': 'O'}, {'友': 'O'}, {'小': 'O'}, {'王': 'B-PER'}, {'想': 'O'}, {'去': 'O'}, {'西': 'B-LOC'}, {'安': 'I-LOC'}]]
predictions, raw_outputs = model.predict(["我 在 江 西 农 业 大 学 学 习 , 但 我 的 朋 友 小 王 想 去 西 安".replace(" ","")])
print(predictions)
[[{'我在江西农业大学学习,但我的朋友小王想去西安': 'O'}]]
ThilinaRajapakse commented 4 years ago

Oh, I get it now! I was thinking of the model's tokenization.

If you don't want to split by spaces (like with Chinese), you can set split_on_space=False when calling the predict() method. In that case, you should provide a list of lists (where the sequences are already split according to how you want them to be split) as to_predict(). This is mentioned here, but maybe it should be more prominent.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.