ThilinaRajapakse / simpletransformers

Transformers for Information Retrieval, Text Classification, NER, QA, Language Modelling, Language Generation, T5, Multi-Modal, and Conversational AI
https://simpletransformers.ai/
Apache License 2.0
4.07k stars 727 forks source link

Adding BERTweet to the available models #607

Closed manueltonneau closed 3 years ago

manueltonneau commented 4 years ago

A BERT-based mode called BERTweet, further pre-trained on English tweets with the RoBERTa pre-training procedure, was recently made available on Hugging Face .

I would love to be able to use it in my simple transformers pipeline. Note that it includes a normalization argument for the tokenizer, which differs from normal tokenization.

If you point me to the relevant py files I need to modify to add it myself, happy to open a PR :)

ThilinaRajapakse commented 4 years ago

This hasn't been released on HF yet, right?

Once it is released, it probably won't require too many changes.

For example, with ClassificationModel:

The model, config, and tokenizer should be added here.

We can probably just check the tokenizer type and set the normalization if it's a BertweetTokenizer here.

manueltonneau commented 4 years ago

I think it was.

I tried using it with ClassificationModel but it gave me issues when using model_name = bert, probably because it needs to be using its own BertweetModel and BertweetTokenizer.

Awesome, thanks for the links, will look into adding this and open a PR!

ThilinaRajapakse commented 4 years ago

image

Looks like BertweetModel is not added yet.

Sure, that'll be great!

manueltonneau commented 4 years ago

Right, seems like it's WIP. Will look into it when the PR on transformers is merged. Thanks for your swift reply!

AnilB87 commented 4 years ago

Hi @ThilinaRajapakse is BERTweet model accessible now via simpletransformers library. The model is added to the list of pre-trained models on Huggingface: https://huggingface.co/vinai/bertweet-base.

Please let me know.

ThilinaRajapakse commented 4 years ago

I gave it a quick test but it seems to be using a different tokenizer than the default BERT one, so that's causing some issues.

The issue is that the tokenizer is creating None values in the input features, if anyone wants to investigate.

manueltonneau commented 3 years ago

Hi all! So I looked into it and got the same issue as the one you mention @ThilinaRajapakse when trying to fine-tune the model on binary classification (using bert as model_name):


TypeError Traceback (most recent call last)

in ----> 1 model.train_model(df) ~/Documents/envs/test_bertweet/lib/python3.8/site-packages/simpletransformers/classification/classification_model.py in train_model(self, train_df, multi_label, output_dir, show_running_loss, args, eval_df, verbose, **kwargs) 342 for i, (text, label) in enumerate(zip(train_df.iloc[:, 0], train_df.iloc[:, 1])) 343 ] --> 344 train_dataset = self.load_and_cache_examples(train_examples, verbose=verbose) 345 train_sampler = RandomSampler(train_dataset) 346 train_dataloader = DataLoader( ~/Documents/envs/test_bertweet/lib/python3.8/site-packages/simpletransformers/classification/classification_model.py in load_and_cache_examples(self, examples, evaluate, no_cache, multi_label, verbose, silent) 1030 features = [feature for feature_set in features for feature in feature_set] 1031 -> 1032 all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long) 1033 all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long) 1034 all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long) TypeError: an integer is required (got type NoneType) --------------------------------------------------------------------------- Yet, when using the command detailed [here](https://github.com/VinAIResearch/BERTweet#-example-usage), the output of the Tokenizer are not NaNs: ![image](https://user-images.githubusercontent.com/29440170/94361846-998b7200-00b7-11eb-8daa-f5f4e2c974d7.png) When looking into the (now merged) [PR](https://github.com/huggingface/transformers/pull/6129), it seems that BERTweet has a specific tokenizer (BertweetTokenizer). It also seems that it has the configuration of RoBERTa (see the modifications to `src/transformers/tokenization_auto.py` in the PR). Will add changes following your instructions above and create a PR. PS: they mention in the [README](https://github.com/VinAIResearch/BERTweet) that the model has the same configuration as BERT-base and only share pre-training procedure with RoBERTa, will look into it.
manueltonneau commented 3 years ago

Just created a PR :)