Closed dongfang91 closed 6 years ago
Hello @dongfang91,
yes that is possible. You can do this by loading a saved language model and passing this model to the language model trainer, e.g.:
model = LanguageModel.load_language_model('your/saved/model.pt')
# make sure to use the same dictionary from saved model
dictionary = model.dictionary
# load your new corpus
corpus = Corpus('path/to/your/corpus', dictionary, forward, character_level=True)
# pass corpus and pre-trained language model to trainer
trainer = LanguageModelTrainer(language_model, corpus)
# train with your favorite parameters
trainer.train('resources/taggers/language_model', learning_rate=5)
You may need to experiment with different learning rates. I think a corpus-switch will confuse the learning so the first epochs might be very unstable. You could try a learning rate of 5 or even lower.
We actually never tried switching corpora, so please let us know how well this works!
Yes, sure! Thanks a lot!
@dongfang91 I'm about to do this as well, continuing training on the LMs associated with the Forward/Backward Flair Embeddings with another corpus of about 800M words. Did you find anything of note? Especially interested in re learning rate or other tuning params.
Thanks!
@aronszanto sounds interesting! Will you share your results / experience? This could help others that want to do a similar thing.
@alanakbik am I correct in assuming that I can only use the method you described above if there are no previously unseen words in the specific domain corpus? If that is correct, is there anything I can do if there are some words in my new corpus that don't show up in the original corpus the model was trained on?
Yeah that is generally correct, but we train our models at the character-level. So the only way you would not be able to handle unseen words is if they consisted of previously unseen characters. For instance, if you trained on Arabic text with a language model that was trained with a dictionary of Latin characters. But new words with the same characters would be ok.
@alanakbik Is there a way I can initialize a LanguageModel
from an existing embedding like WordEmbeddings('en-crawl')
? It's not immediately clear to me where the 'your/saved/model.pt'
file is coming from.
@MarcioPorto language models are trained at the character-level in our case so you cannot initialize with word embeddings. You can either train your own language model from scratch by following these instructions which will produce the model file to load.
Or you can use an existing language model that is shipped with Flair, by accessing the model in the FlairEmbeddings, like this:
model: LanguageModel = FlairEmbeddings('news-forward').lm
@alanakbik Does flair currently support a way to fine-tune BERT embeddings natively, or would I have to follow the procedure described in the huggingface/pytorch-transformers
documentation?
@MarcioPorto we don't currently. We will add a native method for fine-tuning FlairEmbeddings soon. Maybe with the new pytorch-transformers library, we can also add such options for other embeddings in the future.
Hi,
The language model is trained on 1-billion word corpus, I want to continue train them on my specific domain corpus, can I do that in flair?
Thanks!