Training with transfer learning: how-to and which features are affected?

explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python

https://spacy.io

MIT License

29.69k stars 4.36k forks source link

Training with transfer learning: how-to and which features are affected? #5408

Closed BramVanroy closed 3 years ago

BramVanroy commented 4 years ago

I am contemplating whether to train a spaCy model, which I have wanted to do for a while now. I was reading through the documentation and the section on transfer learning is very brief so I was hoping to get more information about how this process actually works.

Particularly I am curious whether pretraining (or using something like spacy-transformers) has any impact on the non-vector part of the model, by which I mean the vector properties. In other words, would using a pretrained language model improve for instance the parser?

The documentation reads:

Instead of initializing spaCy’s convolutional neural network layers with random weights, the spacy pretrain command trains a language model to predict each word’s word vector based on the surrounding words.

I've been trying to find a paper or explanation how this works:

what is the architecture of spaCy (modules/layers and so on)?
most important question: using pretrained models, or pretraining manually: which parts or weights of the language model are transferred whereto in the target pipeline? Is it only the token embedding matrix?
since pipeline components are independent of each other, does that mean that the pretrained models are passed to each component?
how do the pretrained weights influence the final model? Does it only affect the vectors similarity part, or can it contribute to accuracy improvements on the tagger and parser, too? Assuming that it's only the embedding matrix that is transferred, I can see how that could help. I am curious how you map/tokenize tokens-to-words and the other way around, though, even though transformers and HF's tokenizers do support these operations.

Looking forward to learning more about this!

honnibal commented 4 years ago

Sorry the docs are so vague on this, it's an experimental feature we haven't published on. I was working on a paper draft around December last year, but I ended up turning my attention to Thinc because I realised there wasn't really a satisfying way to do what I wanted without the models being more exposed.

The pretraining shares the weights in the tok2vec sublayer. This is the CNN and embedding tables in the current architectures. In Thinc v8 parlance, the tok2vec subnetwork has signature Model[List[Doc], List[Floats2d]], i.e. it's input is a batch of doc objects and its output is a batch of arrays, where len(docs) == len(arrays) and [len(docs[i]) == arrays[i].shape[0] for i in range(len(docs)].

Pretraining the embedding and the CNN will pretrain almost all the weights in the tagger (only the output layer won't be pretrained). For the parser and NER, the unpretrained parts will be the hidden layer that constructs the state vector, and the output layer that maps the state vector to action scores.

As you point out, in the v2 architectures the models all use their own copy of the CNN (even if the architecture is the same). You can load in the pretrained weights for multiple components, so if you're training say, a parser and a tagger together, they'll each initialize with weights from the pretraining, separately.

This will all be clearer and better in v3.

BramVanroy commented 4 years ago

Perhaps it's not much use asking many more questions now since you are planning to make this clearer in v3, though I cannot help but be curious.

I understand the part of pretraining on spaCy and then transferring to another task/dataset. I might have misunderstood, but how do you this with pretrained (non-spacy) transformer models? As an example, which weights in the BERT model do you transfer to where in the spaCy model? Or did I misunderstand the concept and is it not actually possible to transfer a pretrained BERT model to the spaCy model architecture?

Now that I re-read the documentation and spacy-transformers, I guess that I misread it at first. In the documentation you talk about pretraining a spaCy model but never talk about transfer learning a non-spacy model (so I got that wrong), and in terms of spacy-transformers it seems that the custom attributes provide access to inference through those transformers and are not related to training/fine-tuning. Is that correct?

BramVanroy commented 3 years ago

After the v3 nightly release I cannot help but be excited. So many new things to discover and learn. That also means, a lack understanding on my part of the integration of transfer learning in spaCy. After reading through the documentation, my main questions can be distilled to:

when one adds a Transformer component at the start of the pipeline and shares it with the other components, it seems that what actually happens is that all tokens are passed through the pretrained model and its logits are then passed to the other components. But how do those components then use those logits? What kind of models are those components themselves?
quite generally although related to the previous: how can a full spaCy pipeline be trained (tokenizer, parser, tagger) when starting from a pretrained Transformer? So either by finetuning the transformer, or by training it from-scratch.

This is not an urgent question, so feel free to jump in when you find the time. Thanks again for all the hard and awesome work!

svlandeg commented 3 years ago

Not sure I understand all your questions, but let me try shedding some light (or confuse you for good):

What kind of models are those components themselves?

It depends on the component and the task it's trying to do, right? The Transformer can be swapped in for any Tok2Vec layer that you have, and should have type Model[List[Doc], List[Floats2d]]. The tagger just adds a softmax output layer on top of that, but other components like the parser could be having much more complex layers on top.

How can a full spaCy pipeline be trained (tokenizer, parser, tagger) when starting from a pretrained Transformer?

There's a bit more info in the docs here: https://nightly.spacy.io/usage/embeddings-transformers#training-custom-model

You can have multiple components all listening to the same transformer model, and all passing gradients back to it. By default, all of the gradients will be equally weighted. You can control this with the grad_factor setting, which lets you reweight the gradients from the different listeners. For instance, setting grad_factor = 0 would disable gradients from one of the listeners, while grad_factor = 2.0 would multiply them by 2. This is similar to having a custom learning rate for each component. Instead of a constant, you can also provide a schedule, allowing you to freeze the shared parameters at the start of training.

BramVanroy commented 3 years ago

Hi Sofie, thanks for taking the time!

Aha, that model directory was precisely what I was looking for. I like to be able to see how the models actually work - what their architecture is, what goes in and what goes out - to better understand what is actually behind the scenes when I write nlp("I like cookies."). I assume that one could hack away and build custom components (e.g. by modifying the tagger.py file), but that those trained models would then not work on other users' spaCy version? In other words, customizing the architecture of the default components is not straightforward if you want to make the trained models compatible with the default spaCy library?

Thanks for linking to the documentation. I was looking in the wrong place, namely here it seems. Having never trained a spaCy model before the use of the config and making sure that all parts fit together seems challenging, but I look forward to getting my hands dirty when I find the time!

svlandeg commented 3 years ago

Cool. Yes, that model directory is new in v3. Before, the definitions and parameters would be kind of scattered around the code base, and much more difficult to change because of hidden & overridden defaults etc. All that is different now with the config system. The models are also described in detail here: https://nightly.spacy.io/api/architectures

I assume that one could hack away and build custom components (e.g. by modifying the tagger.py file), but that those trained models would then not work on other users' spaCy version? In other words, customizing the architecture of the default components is not straightforward if you want to make the trained models compatible with the default spaCy library?

You shouldn't be hacking at the tagger.py file. Instead, write a similar function, adjust it, and register it with @registry.architectures.register("my_tagger.v1"). Then you'll be able to use it in the config:

[components.tagger]
factory = "tagger"

[components.textcat.model]
@architectures = "my_tagger.v1"

That is the beauty of v3: you can very easily customize the models now! See also: https://nightly.spacy.io/usage/layers-architectures

I'd advice you to go through all the new docs in a bit more detail, because I know it's a lot but I promise you it's worth it ;-)

github-actions[bot] commented 3 years ago

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.