allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.76k stars 2.25k forks source link

How to use Glove in python code #2694

Closed Xiraaa closed 5 years ago

Xiraaa commented 5 years ago
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM,
                           pretrained_file='glove.840B.300d.txt')

I am new to this... I tried the code above but it seemed the same with the one without 'pretrained_file' param

kernelmachine commented 5 years ago

Can you ask a more specific question, with some more context around the issue you are seeing?

Xiraaa commented 5 years ago

I just modified a single line in the given tutorial code from this https://allennlp.org/tutorials by adding a param pretrained_file='glove.840B.300d.txt' . I already downloaded this file but it doesn't work.. Appreciate it a lot if you could give an example about this. That would be helpful.

kernelmachine commented 5 years ago

Can you provide a stacktrace which details the error that you're receiving?

Xiraaa commented 5 years ago
EMBEDDING_DIM = 300
HIDDEN_DIM = 6
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM,
                           pretrained_file='glove.840B.300d.txt')
word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding})
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
model = LstmTagger(word_embeddings, lstm, vocab)

No error was raised after my modification. But I found no difference in the embeddings with pertained_file. I print those embeddings in the forward function. And I don't know if I have to change the indexer so I just use the original one in the tutorial code.

    def forward(self,
                sentence: Dict[str, torch.Tensor],
                labels: torch.Tensor = None) -> Dict[str, torch.Tensor]:
        mask = get_text_field_mask(sentence)
        embeddings = self.word_embeddings(sentence)
        print(embeddings)
        encoder_out = self.encoder(embeddings, mask)
        tag_logits = self.hidden2tag(encoder_out)
        output = {"tag_logits": tag_logits}
        if labels is not None:
            self.accuracy(tag_logits, labels, mask)
            output["loss"] = sequence_cross_entropy_with_logits(tag_logits, labels, mask)

        return output
matt-gardner commented 5 years ago

The issue is that we don't support loading a pretrained file from the constructor. It appears the constructor parameter named pretrained_file is undocumented (cc @bryant1410; it looks like your script misses the fact that we put __init__ parameters in the class docstring) - it is only used to keep track of stuff for loading more embeddings at test time. If you want to actually load a pretrained embedding file, you currently need to do that by calling Embedding.from_params() (or Embedding. _read_pretrained_embeddings_file() to get the weight, which you then pass to the constructor). We should probably make this easier, and document the constructor parameter.

Xiraaa commented 5 years ago
EMBEDDING_DIM = 300
HIDDEN_DIM = 6
token_embedding = Embedding.from_params(
                            vocab=vocab,
                            params=Params({'pretrained_file':'glove.840B.300d.txt',
                                           'embedding_dim' : EMBEDDING_DIM})
                            )

Thanks for your help. It works! Btw Embedding. _read_pretrained_embeddings_file() will raise no attribute error.

NOTE: I (@matt-gardner) modified the code block in here to fix an error, in case future users stumble across this issue.

bryant1410 commented 5 years ago

(cc @bryant1410; it looks like your script misses the fact that we put __init__ parameters in the class docstring)

Yeah. Thanks for the heads-up. Somehow PyCharm feature to show the docs of a function or class works well with that, but the lint checking fails on it.

arianhosseini commented 4 years ago

Hi, I'm running into this error using the following from_params. I'm using allennlp 1.0.1 and I installed allennlp_models using pip.

embeddings = Embedding.from_params(vocabulary, Params({"embedding_dim": embeddings_dimension,
                                                          "pretrained_file": file_path,                                                                       
                                                          "vocab_namespace": "tokens",
                                                          "trainable": False}))     

Error:

File "coref.py", line 40, in read_embeddings "trainable": False})) File "/data/home/test/cproject/allennlp/allennlp/common/from_params.py", line 533, in from_params "from_params was passed aparamsobject that was not aParams. This probably " allennlp.common.checks.ConfigurationError: from_params was passed aparamsobject that was not aParams. This probably indicates malformed parameters in a configuration file, where something that should have been a dictionary was actually a list, or something else. This happened when constructing an object of type <class 'allennlp.modules.token_embedders.embedding.Embedding'>.

Could you please take a look at this?