UKPLab / emnlp2017-bilstm-cnn-crf

BiLSTM-CNN-CRF architecture for sequence tagging
Apache License 2.0
822 stars 263 forks source link

Adding new embeddings to a trained model. #58

Open S4ltedF1sh opened 4 years ago

S4ltedF1sh commented 4 years ago

Hi, I'm currently using this model for poems sentiment analysis. I trained the model with certain amount of poems, with each line is used as a token and each line has its own embedding in the embedding file. The problem is that after the training, I want to use it on other unseen poems (their embedding are not in the embedding file). When I tried to add their embeddings to the embedding file and ran the model, it just returned this error: tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,4] = 4837 is not in [0, 4827) [[Node: word_embeddings/embedding_lookup = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](word_embeddings/embeddings/read, _arg_words_input_0_0, word_embeddings/embedding_lookup/axis)]]

which I assume that after the training, the embedding size of the model is fixed and you can't add any further embedding. So I want to ask how can I add new embeddings to the model or how can I use the model to predict unseen poems?

S4ltedF1sh commented 4 years ago

this is the image of the full error: https://imgur.com/a/4v8VoSJ

nreimers commented 4 years ago

Hi @S4ltedF1sh this is not quite straight forward.

The model is loaded here: https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/blob/b709f580f11c33c0f9951a0afdc3e71a252c93fd/neuralnets/BiLSTM.py#L611

What you need to call on your bilstm-models is this function:

bilstm.setMappings(new_mappings, new_embeddings)

It is important to call it before the method buildModel is invoked.

Best Nils Reimers

S4ltedF1sh commented 4 years ago

Hi @nreimers , many thanks for the quick answer, however I'm not really sure by what you mean:

It is important to call it before the method buildModel is invoked.

As I understand correctly, the buildmodel method is only called once before the training starts, and isn't invoked while loading a trained model. So where should I call the setmapping method when I load my trained model? Or is it only possible to add more embeddings before the training? I checked the code and threre is a cap for the maximum features which I assume is the index of the token in the embeddding file (line 105, BiLSTM.py, buildmodel function):

tokens = Embedding(input_dim=self.embeddings.shape[0], output_dim=self.embeddings.shape[1], weights=[self.embeddings], trainable=False, name='word_embeddings')(tokens_input)

So because of this input_dim=self.embeddings.shape[0] I think it's capped at the current size of the embedding file and you can't add anymore embeddings after the training. Is it right?

Many thanks in advance, Minh Vu Pham

nreimers commented 4 years ago

Hi @S4ltedF1sh The quoted line creates a keras embedding layer with the size of your numpy self.embeddings matrix. If you add new embeddings to self.embeddings, it will also be used by keras in the embedding layer.

However, it is important that you add these new embeddings before tokens = Embedding(...) is invoked.

This buildMethod is invoked when training or inference is started.

Best Nils Reimers