TurkuNLP / FinBERT

BERT model trained from scratch on Finnish
Other
96 stars 7 forks source link

Ideas #3

Closed R4ZZ3 closed 4 years ago

R4ZZ3 commented 4 years ago

Hi,

First of all thanks for great embeddings. I was able to combine these with ktrain library and train 5 class classifier of Suomi24 topics (around 10 000 samples) and got 97% accuracy after only one epoch.

What are your upcoming ideas for taking this further?

Could you please try to share these embeddings so these would work with Flair? https://github.com/zalandoresearch/flair/blob/master/resources/docs/embeddings/TRANSFORMER_EMBEDDINGS.md

Also I would appreciate if you could train other embeddigns like RoBERTA to see if you get any improvements over BERT.

haamis commented 4 years ago

Hi,

Glad to hear you got it working.

I looked at the Flair examples you linked and combined one with our instructions to using the model with transformers:

from flair.embeddings import BertEmbeddings, Sentence
import transformers
transformers.BERT_PRETRAINED_MODEL_ARCHIVE_MAP["bert-base-finnish-cased-v1"]="http://dl.turkunlp.org/finbert/torch-transformers/bert-base-finnish-cased-v1/pytorch_model.bin"
transformers.BERT_PRETRAINED_CONFIG_ARCHIVE_MAP["bert-base-finnish-cased-v1"]="http://dl.turkunlp.org/finbert/torch-transformers/bert-base-finnish-cased-v1/config.json"
transformers.tokenization_bert.PRETRAINED_VOCAB_FILES_MAP["vocab_file"]["bert-base-finnish-cased-v1"]="http://dl.turkunlp.org/finbert/torch-transformers/bert-base-finnish-cased-v1/vocab.txt"
transformers.tokenization_bert.PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES["bert-base-finnish-cased-v1"]=512
transformers.tokenization_bert.PRETRAINED_INIT_CONFIGURATION["bert-base-finnish-cased-v1"]={'do_lower_case': False}

# init embedding
embedding = BertEmbeddings("bert-base-finnish-cased-v1")

# create a sentence
sentence = Sentence('Ruoho on vihreää .')

# embed words in sentence
embedding.embed(sentence)

This seems to work, although I didn't test anything beyond this example. Making a pull request to transformers to have it included out of the box is something I expect us to get done fairly soon.

As for training RoBERTA, while I think it is reasonable to assume improvements in the model's capability, it is computationally very expensive to do so. Cost-wise RoBERTA is more or less BERT-Large that has been trained for approximately 10 times longer, whereas our model is the smaller BERT-Base variant. In the RoBERTA paper they mention using 1024 V100 GPUs. We used 8 of those GPUs for training FinBERT. In other words, we don't really have the resources to train a model like RoBERTA.

R4ZZ3 commented 4 years ago

Hi,

Thanks for input. I was able to use these embeddings with Flair by using DocumentPoolEmbeddings. Now I can combine these with Flairembeddings as they have already made FlairEmbeddings available for finnish. Will report results once I get those ready. Case closed