IndicoDataSolutions / finetune

Scikit-learn style model finetuning for NLP
https://finetune.indico.io
Mozilla Public License 2.0
701 stars 81 forks source link

Load a model trained to predict next character or subword instead of next word #125

Closed mathetes87 closed 5 years ago

mathetes87 commented 6 years ago

Hi,

Character based language modeling has its advantages over word level prediction, and I'm wondering if I'll be able to use this wrapper or not.

My plan is to train a model using Google's T2T as documented here. The model can be trained using subword encoding (default), character or word level encodings. If I were to use any of this options, would the model saved work out of the box with finetune? Should I beware of any details when training the model?

The repo looks like it is very well made, I hope this would be seamless. Does anyone know?

madisonmay commented 6 years ago

Hi there, thanks for the interest!

The main advantage of working with character-level generative models is that the discrete space you're working with is much smaller -- there are about 97 English-language characters in common usage if we include all punctuation marks. By contrast, a vocabulary is many thousands of words. This implies that just storing the word embeddings will require a lot of memory, and including word embeddings in a model adds many, many parameters to the model so the computational cost is much higher on this account than the character-level model.

This is true but it comes at the cost of increasing the sequence length and making it more difficult to model long term dependencies, so character level language modeling isn't a universal win. Byte-pair encoding, which this repo uses, (described below) is generally regarded as a solid middle ground.

Misspelled words or other words not appearing in the vocabulary are usually treated as a special "unknown" token. This suffices for some practical contexts, but it also means that the model is not terribly flexible when it comes to generate new text.

This model uses byte-pair encoding, which falls somewhere between character based language modeling and traditional word / token based language modeling. When a word does not appear in vocabulary, it's is represented by the longest subwords that do appear in vocabulary. So even if the model had never seen the word "earthquake" before, the model could use the meaning of "earth" and "quake" to infer that "earthquake" should mean something about shaking ground. This also means that your vocabulary can be a bit smaller, and that you can rely on this fallback rather than having an explicit token for 100k+ words. You can read up on Neural Machine Translation of Rare Words with Subword Units for more information there.

My plan is to train a model using Google's T2T as documented here. The model can be trained using subword encoding (default), character or word level encodings. If I were to use any of this options, would the model saved work out of the box with finetune? Should I beware of any details when training the model?

This will unfortunately not function. This repository makes the assumption that you're starting from the pre-trained model provided by OpenAI in their paper which ships with this package "Improving Language Understanding Through Generative Pre-Training". The tensor2tensor seems to be designed mostly for training models from scratch, but their default recommendations (namely, the use of a transformer model) align with what's already used by default in finetune. If your goal is to solve a supervised classification / regression / sequence labeling task I would see if the pre-trained finetune model works for you, otherwise you could try to finetune on an unsupervised objective on a large corpus and then swap over to a supervised loss to solve your end task (see below).

model.fit(unlabeledX)
model.fit(unlabeledX, unlabeledY)
mathetes87 commented 6 years ago

Thanks a lot for the extensive response!

This model uses byte-pair encoding, which falls somewhere between character based language modeling and traditional word / token based language modeling.

Gotcha

The tensor2tensor seems to be designed mostly for training models from scratch

Sorry for being vague before, what I actually want to do is train a transformer model in spanish and then do some classification tasks.

their default recommendations (namely, the use of a transformer model) align with what's already used by default in finetune

That statement left me a little puzzled. Do you mean their model is similar to what finetune expects but not exactly the same?

you could try to finetune on an unsupervised objective on a large corpus and then swap over to a supervised loss to solve your end task

Given that it is a totally different language, can I do as you say? Or where you assuming that I needed a model for the english language?

Again, thanks a lot for your help

madisonmay commented 6 years ago

Thanks for providing more specifics, this is super helpful!

There may be a way you could use a model trained in tensor2tensor in finetune but it would be a very high effort endeavor and I would be worried that a small difference in implementation could end up causing major problems that might be hard to diagnose.

I was definitely assuming you needed an English language model (sorry!). Given that you're looking for a spanish language model, you will probably need to train a new base model from scratch, and finetune is currently missing some of the components you would need to make this work (a method for re-fitting the byte-pair encoder and proper weight initialization).

Given all of this, your best bet might be to go with something like ULMFit from FastAI. It's a bit different (based on LSTM models instead of Transformer models), but the base idea is the same (finetuning language models for classification tasks) and has better support for training from scratch. There has been quite a lot of work already to make ULMFit work well for languages other than English, including Spanish in particular.

We intend to add this functionality to finetune in the future but there's a fair amount of work to put in before that's possible unfortunately.

benleetownsend commented 5 years ago

@mathetes87 Im going to close this for now, if you require further assistance feel free to reopen.