RFE Doc spaCy2: Append a custom NER trained model to an existing model

asyd commented 7 years ago

I'm a new user of spaCy, and regarding the status of develop branch I choose to use it. My goal is to use the model en_core_web_sm with my trained model.

I follow this documentation to learn some entities, all is working good when I use the following code to test:

nlp = spacy.lang.en.English(pipeline=['tensorizer', 'ner'])
nlp.from_disk('path_to/model')

The model is able to detect my new entities. However, I loose others capabilities of en_core_web_sm model. So I'm wondering how to load the model, and then "merge" with the ner pipeline of my trained model.

Note I tried few things:

loading both, and trying nlp.pipeline.append(nlp_custom.pipeline[1])
call from_disk('path', exclude=('vocab')) with the original model

but whatever I tried it failed either by can't read file or directory or ValueError: could not broadcast input array from shape.

So my general request is: Can you provide a full example of loading a default model, trained it for a specific pipeline (like ner) and then how to use this model without loosing capabilities from the original model?

Thanks!

mikeatm commented 7 years ago

are you attempting to take the pretrained model and extend it with custom NER?, it seems like a know issue right now, not yet possible in v2 https://github.com/explosion/spaCy/issues/1130#issuecomment-308743015

asyd commented 7 years ago

Thanks you

honnibal commented 7 years ago

This is definitely a problem at the moment.

The multi-task learning component is currently making this tricky. I need to add a mode where the models can own their own tensorizer component. Currently if you train only one model in the pipeline, it changes the CNN representation, and the other models don't have a chance to adjust. So for instance, if you just update the NER model, it will ruin the accuracy of the parser and tagger.

tcphw5 commented 7 years ago

Was having this same issue. Thanks for the response. If you could mention this issue in the work in progress or bug fixes that would be very helpful

honnibal commented 7 years ago

I think I have a good solution.

To recap, the current setup is roughly like this:

    def update_pipeline(self, doc, gold): # NB: This is pseudocode, not the literal API
        # Reminder about backpropagation:
        # We always have two functions: forward and backward
        # Y  = forward(X)
        # dX = backward(dY)
        # X and dX must match in shape,
        # Y and dY must match in shape as well.
        # 
        # Compute word embeddings, get callback for backprop
        vectors, bp_vectors = self.embeddings.begin_update(doc)
        # Feed the embeddings forward into the convolutional layer.
        tensor, bp_tensor = self.cnn.begin_update(vectors)
        if self.tagger.has_gold(gold):
            # If the example has gold tags, update the tagger
            # The tagger takes tensor as input forward pass, so it
            # returns d_tensor from backward
            tags, bp_tags = self.tagger.forward(tensor)
            d_tags = self.tagger.get_loss(tags, gold)
            d_tensor = bp_tags(d_tags)
        if self.dep_parser.has_gold(doc, gold)
            deps, bp_deps = self.dep_parser.forward(tensor)
            d_deps = self.dep_parser.get_loss(deps, gold)
            d_tensor += bp_deps(d_deps)
        if self.entity_recognizer.has_gold(gold):
            ents, bp_ents = self.ent_parser.forward(tensor)
            d_ents = self.ent_parser.get_loss(ents, gold)
            d_tensor += bp_ents(d_ents)
        bp_tensor(d_tensor)

The same tensor is fed forward into the tagger, parser and entity recognizer, so the same d_tensor gradient is increment for all three pipeline components. If you make updates with one model but not the others, the CNN and embeddings will learn to produce a different representation. The other models will be confused by this, leading to low accuracy.

Okay. So, that's the problem. We obviously should support updates when some of the gold standard is missing. You shouldn't be blocked from making NER updates unless you have POS tags. The only way to achieve that is to make the multi-task embedding and CNN layers immutable after training.

Instead of updating the shared weights, we could equivalently another input representation for each of the pipeline components. The second input would be pipeline specific, instead of being shared. We could then compute:

    embed1, bp_embed1   = shared_vectors(doc)
    embed2, bp_embed2  = model_vectors(doc)
    embed = embed1+embed2
    # Later
    bp_embed2(d_embed)
    # Don't pass gradient back to bp_embed1 -- shared_vectors is static

In other words, instead of learning to adjust an input X, we make the input X+X', and then learn X'. This is equivalent, but we don't modify X, so we don't care what other models use X as an input.

honnibal commented 7 years ago

Added a flag update_tensors on Language.update(). This is messy, and will probably be removed before the 2.0 release. But for now it can help us figure out how important it is to update the tensors when updating an existing model.

I've implemented a first draft of the model described above for the tagger. As expected the extra residual connection doesn't interfere during normal training. However, if updates to the CNN cease early, the model started learning very slowly. I suspect the other channel isn't actually being updated.

honnibal commented 7 years ago

@sebastianruder I'd appreciate your input on this if you have a second :)

To simplify a little, let's say I have an embedding table E_common shared between the tagger and the NER. I want to have:


E_tags = mix_tags[0] * E_common(doc) + mix_tags[1] * E_tags_private(doc)
E_ner  = mix_ner[0]  * E_common(doc) + mix_ner[1]  * E_ner_private(doc)

Should I be calling this a sluice network? And do you think I should make the mixture weights scalars, or will I benefit from making them vectors, so that I can use componentwise weights? Apologies for being a bit lazy here -- I should just spend some more time reading your paper :)

asyd commented 7 years ago

Thanks you! I'll test that in few days!

sebastianruder commented 7 years ago

@honnibal, just to clarify: Is E_common the word embeddings table or the projection matrix for the output labels? Are E_tags_private and E_ner_private the private word embedding tables of the POS and NER models? Assuming you're talking about private word embeddings, I think just having shared embeddings is better in order not to blow up the # of parameters. I would then, though, have private LSTM layers with skip-connections for each model. Using scalars as mixture weights worked well for us. You can call this a sluice network if you want. :)

honnibal commented 7 years ago

@sebastianruder

E_common: Shared embedding table E_tags_private: Proposed embedding table that's private to the tagger E_ner_private: Proposed embedding table that's private to the NER

The requirement is that we want shared models during the main training process, on the full corpus. However, users should be able to fine-tune only one component at a time, without the need for supervision on the other components. One way to achieve that is to self-train the other components while making the updates. This does work, especially if you make a copy of the model and supervise based on the original model state. However, the shared/private system does seem attractive.

The embedding tables in spaCy are actually very small. I should really have published the recipe last year, but I haven't had time. The main trick is to use what are now being called "Bloom Embeddings". Basically, you allocate a fairly small number of rows, hash the IDs, and mod them into the table. To reduce the impact of collisions, each ID is hashed four times, and four vectors are summed. This means that almost every word receives a distinct representation, even when the table has very few rows.

The next trick is to construct the vector for each word by embedding multiple lexical features, concatenating the representations, and mixing them with a hidden layer. I've found embedding the lower-case, prefix, suffix and word shape to work well. This complements the hashing trick by further reducing the impact of collisions. If some stupid number collides with an important word, the network can still give distinct representations. The embedding strategy is more intricate than most, but it can still be computed efficiently, because you can cache the representation by lexical item. I do this on a per-batch basis for simplicity.

Finally, four convolutional layers are used, to draw in surrounding context. This further mitigates the hash collisions, because it's unlikely we get a sequence of words all with problematic collisions.

I've been using this scheme instead of LSTM in all my models. It's hard to use LSTM for spaCy because spaCy is required to process documents of arbitrary length. In any case for tagging, parsing and NER, it's not clear that we want the word representations to be conditioned on arbitrary context. A 4-word window either side of the target is actually quite enormous. I'm not sure I want to condition on a context wider than that. The training data has around 20k documents, but over 2m words. So we don't want to learn too much that's document specific.

There's a couple of implementations of this across my examples, but the one in spaCy is here: https://github.com/explosion/spaCy/blob/develop/spacy/_ml.py#L211 . Another implementation is in the tagger example in Thinc: https://github.com/explosion/thinc/blob/master/examples/cnn_tagger.py#L123 . That one's notable because Spanish POS tagging is quite easy, so we can get away with amazingly small embedding tables --- only 200 rows still gets 98%.

So, that's why I'd like to try giving each model a private copy of the whole shared CNN-embed subnetwork. If size is a problem, I'd just scale down the sizes of the private layers. I just want to make sure models hae a way to adjust their input representations without affecting their sibling models.

sebastianruder commented 7 years ago

@honnibal Oh, wow! That's pretty cool. I wasn't aware of the intricate details that are part of the spacy implementation. :)

Are the Bloom Embeddings you mention similar to the Bloom Mapped Word Clusters described here? Do you have another reference for this?

In that case, I think it makes total sense to have a private copy with the same architecture as the shared CNN-embed model. You can have task-specific private ones and a shared one as done here. Alternatively, you can use two private models with mixture weights as we do. Also definitely use skip-connections to allow for supervision at different levels.

honnibal commented 7 years ago

This paper describes Bloom Embeddings: https://arxiv.org/pdf/1706.03993.pdf . I must admit I was a little disappointed someone else published it before I wrote it up :p. It's not that surprising though -- I'm sure a bunch of other people have also done this, since it's fairly obvious to mod the OOV IDs into the table. The extension to use multiple hashes is then straightforward.

If I understand correctly the Bloom-mapped clusters is the same thing, yes. You can do this for any int-to-vector conversion. Incidentally I've tried using clusters in the same way in spaCy --- all you have to do is add CLUSTER to the feature extracter and then add another table embedding that value. This didn't improve results when I tried it, and I don't want spaCy 2 to rely on word clusters. spaCy 1 made heavy use of Brown clusters, as they're one of the best ways to do transfer learning for linear models. The problem is Brown clusters take ages to train, and the only implementation doesn't scale well.

honnibal commented 7 years ago

How do the skip connections work? I don't think I understood that part in your paper.

sebastianruder commented 7 years ago

Thanks for the reference! I'll look more into Bloom Embeddings. :)

Re skip-connections, in our paper, we just feed the outputs of every intermediate layer as input to the softmax layer. We use a scalar to weight the contribution, but you can also just an (unweighted sum). This is very similar to residual connections. If some more parameters are not an issue, you could also concatenate the outputs of all intermediate layers and feed the concatenated representation to the final layer. I've written a brief overview of the difference between these connections here.

honnibal commented 7 years ago

Oh right. That sounds related to the DenseNet idea. It's been interesting to see this develop. One of the early lessons doing NLP with linear models was "just make everything a feature of the last model". The instinct was always to design some complicated flow, but it was always better to just let the final maxent weight everything. I'm not really surprised to see that the same principle coming through.

I already have residual connections through the convolutional layers, and the skip connections will complicate my implementation a lot. I'll take a guess and say the residual connections are probably enough :)

sebastianruder commented 7 years ago

If you already have those, then that should be fine. :) Yeah, I've seen this occasionally as well where it seemed to help that the input features in addition to the processed features are again fed into the final model.

mikeatm commented 7 years ago

@honnibal I tried the develop branch:(https://github.com/explosion/spaCy/commit/b40bc20b121118468022dfbcc24ba10dc68fbaec) latest at the time of this comment. but i could not train with the update_tensors, this is my error:

....
  File "/home/data/experim/spc/spaCy-develop/spacy/language.py", line 523, in <lambda>
    deserializers[proc.name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
  File "spacy/pipeline.pyx", line 125, in spacy.pipeline.BaseThincComponent.from_disk (spacy/pipeline.cpp:10139)
  File "/home/data/experim/spc/spaCy-develop/spacy/util.py", line 485, in from_disk
    reader(path / key)
  File "spacy/pipeline.pyx", line 121, in spacy.pipeline.BaseThincComponent.from_disk.lambda7 (spacy/pipeline.cpp:9661)
  File "/home/data/experim/spaenv/lib/python3.5/site-packages/thinc/neural/_classes/model.py", line 352, in from_bytes
    copy_array(dest, param[b'value'])
  File "/home/data/experim/spaenv/lib/python3.5/site-packages/thinc/neural/util.py", line 48, in copy_array
    dst[:] = src
ValueError: could not broadcast input array from shape (128,2,384) into shape (128,384)

im using the example code to train:

    nlp = en_core_web_sm.load() 
    get_data = lambda: reformat_train_data(nlp.tokenizer, train_data)
    optimizer = nlp.begin_training(get_data)
    for itn in range(100):
        random.shuffle(train_data)
        losses = {}
        for raw_text, entity_offsets in train_data:
            doc = nlp.make_doc(raw_text)
            gold = GoldParse(doc, entities=entity_offsets)
            nlp.update([doc], [gold], drop=0.5, sgd=optimizer, losses=losses,update_tensors=False)

both nlp = en_core_web_sm.load() and nlp = English(pipeline=['tensorizer', 'ner']) give a similar error. do you have an example of how you used it?

ines commented 6 years ago

Copying over my comment from #1159 – the problems described here should all be fixed and documented on develop, and will be included in the next release.

Sorry about the messy training examples an docs! I spent the past few days going over all examples, cleaning them up and adding more documentation.

Here's the new training examples directory: https://github.com/explosion/spaCy/tree/develop/examples/training

The current state only works with the spaCy version on develop – which will be released as soon as the new models are done training. The new docs are already in the website directory on develop, but not live yet, since we want to push the new version first.

(Unless there are serious bugs or problems, the upcoming alpha version will probably also be the version we'll promote to the release candidate 🎉 )

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

RFE Doc spaCy2: Append a custom NER trained model to an existing model #1182