explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.66k stars 4.36k forks source link

Adding Universal Language Model Fine-tuning ULMFiT pre-trained LM to spacy and alowing a simple way to train new models #2342

Closed jmizgajski closed 5 years ago

jmizgajski commented 6 years ago

Feature description

Universal Language Model Fine-tuning for Text Classification presented a novel method to fine tune a pre-trained universal language model to a particular classification task which achieved beyond state-of-the art (18-24% reduction in error rate) on multiple benchmark text classification tasks. The fine tuning requires very few examples (100) to achieve very good results.

Here is an excerpt of the abstract which provides a good TL;DR of the paper (duh):

Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model. Our method significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18- 24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100× more data. We opensource our pretrained models and code

I propose that spacy adds their pre-trained models and a simple way to fine tune to a new task as a core feature of the library.

Could the feature be a custom component or spaCy plugin?

If so, we will tag it as project idea so other users can take it on.

This seems like a core feature of spacy, greatly increasing its industrial potential. I would argue to make it a first class citizen if authors and licensing of this work permits that.

wojciak commented 6 years ago

This! 👍

ngoel17 commented 6 years ago

Like It.

sebastianruder commented 6 years ago

Author here. I'd love to see this happen and I'm sure @jph00 would also be on board. Fast.ai is working on pre-trained models for other languages and we'll be working to simplify and make the code more robust.

jph00 commented 6 years ago

For sure - I discussed the basic idea of LM fine-tuning with @honnibal recently. I'd be happy to improve integration between fastai's language modeling and forthcoming model zoo and spacy. Our model zoo should work fine with anything based on pytorch - to work with thinc would require porting the architecture and weights, of course.

(Note that this would require also porting the various regularization approaches in AWD LSTM to thinc too, since they're critical to this approach.)

DuyguA commented 6 years ago

Would love to do the pre-trained Turkish model!

honnibal commented 6 years ago

Super keen on this! @jph00 the vision for plugging in other libraries is to have Thinc as a thin wrapper on top. I've just merged a PR on this, and have fixed up an example of wrapping a BiLSTM model and inserting it into a Thinc model: https://github.com/explosion/thinc/blob/master/examples/pytorch_lstm_tagger.py#L122

You can find the wrapper here: https://github.com/explosion/thinc/blob/master/thinc/extra/wrappers.py#L13

This wrapping approach is the long-standing plan for plugging "foreign" models into spaCy and Prodigy. We want to have similar wrappers for Tensorflow, DyNet, MXNet etc. The Thinc API is pretty minimal, so it's easy to wrap this way.

Btw, as well as a plugin, I'm very interested in finding the right solution for pre-training the "embed" and "encode" steps in spaCy's NER, parser, etc. The catch is that our performance target is 10k words per second per CPU core, which I think means we can't use BiLSTM. The CNN architecture I've got is actually pretty good, and we're currently only a little off the target (7.5k words per second in my latest tests).

slavakurilyak commented 6 years ago

Going from initializing the first layer of our models to pretraining the entire model with hierarchical representations is a must! For additional inspiration, check out NLP's ImageNet moment has arrived by The Gradient.

ines commented 5 years ago

See the latest nightly releases: https://github.com/explosion/spaCy/releases/tag/v2.1.0a3

superzadeh commented 5 years ago

Amazing work @ines @honnibal and all the other contributors, can't wait to give this a shot!

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.