Is there any way I can continue training the language model on a specific domain

flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)

https://flairnlp.github.io/flair/

Other

13.98k stars 2.1k forks source link

Is there any way I can continue training the language model on a specific domain #121

Closed dongfang91 closed 6 years ago

dongfang91 commented 6 years ago

Hi,

The language model is trained on 1-billion word corpus, I want to continue train them on my specific domain corpus, can I do that in flair?

Thanks!

alanakbik commented 6 years ago

Hello @dongfang91,

yes that is possible. You can do this by loading a saved language model and passing this model to the language model trainer, e.g.:

model = LanguageModel.load_language_model('your/saved/model.pt')
# make sure to use the same dictionary from saved model
dictionary = model.dictionary

# load your new corpus
corpus = Corpus('path/to/your/corpus', dictionary, forward, character_level=True)

# pass corpus and pre-trained language model to trainer
trainer = LanguageModelTrainer(language_model, corpus)

# train with your favorite parameters
trainer.train('resources/taggers/language_model', learning_rate=5)

You may need to experiment with different learning rates. I think a corpus-switch will confuse the learning so the first epochs might be very unstable. You could try a learning rate of 5 or even lower.

We actually never tried switching corpora, so please let us know how well this works!

dongfang91 commented 6 years ago

Yes, sure! Thanks a lot!

aronszanto commented 5 years ago

@dongfang91 I'm about to do this as well, continuing training on the LMs associated with the Forward/Backward Flair Embeddings with another corpus of about 800M words. Did you find anything of note? Especially interested in re learning rate or other tuning params.

Thanks!

alanakbik commented 5 years ago

@aronszanto sounds interesting! Will you share your results / experience? This could help others that want to do a similar thing.

MarcioPorto commented 5 years ago

@alanakbik am I correct in assuming that I can only use the method you described above if there are no previously unseen words in the specific domain corpus? If that is correct, is there anything I can do if there are some words in my new corpus that don't show up in the original corpus the model was trained on?

alanakbik commented 5 years ago

Yeah that is generally correct, but we train our models at the character-level. So the only way you would not be able to handle unseen words is if they consisted of previously unseen characters. For instance, if you trained on Arabic text with a language model that was trained with a dictionary of Latin characters. But new words with the same characters would be ok.

MarcioPorto commented 5 years ago

@alanakbik Is there a way I can initialize a LanguageModel from an existing embedding like WordEmbeddings('en-crawl')? It's not immediately clear to me where the 'your/saved/model.pt' file is coming from.

alanakbik commented 5 years ago

@MarcioPorto language models are trained at the character-level in our case so you cannot initialize with word embeddings. You can either train your own language model from scratch by following these instructions which will produce the model file to load.

Or you can use an existing language model that is shipped with Flair, by accessing the model in the FlairEmbeddings, like this:

model: LanguageModel = FlairEmbeddings('news-forward').lm

MarcioPorto commented 5 years ago

@alanakbik Does flair currently support a way to fine-tune BERT embeddings natively, or would I have to follow the procedure described in the huggingface/pytorch-transformers documentation?

alanakbik commented 5 years ago

@MarcioPorto we don't currently. We will add a native method for fine-tuning FlairEmbeddings soon. Maybe with the new pytorch-transformers library, we can also add such options for other embeddings in the future.