allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.74k stars 2.24k forks source link

Multiple indexers for one text_field_embedder #1088

Closed ajfisch closed 6 years ago

ajfisch commented 6 years ago

Hi,

Thanks for the very cool work! It looks like if you want to write a text_field_embedder, you can only take one input --- the batched tensor from the corresponding text field indexer.

If I wanted to write a text field embedder that took multiple inputs, say token ids and character ids, can I do that?

matt-gardner commented 6 years ago

This is in fact exactly what the TokenIndexers and TextFieldEmbedders were designed to do. You can see how to configure this here:

https://github.com/allenai/allennlp/blob/1655f229e8ce5ab2b705549435ab06e1541394f8/training_config/bidaf.json#L4-L17

and here:

https://github.com/allenai/allennlp/blob/1655f229e8ce5ab2b705549435ab06e1541394f8/training_config/bidaf.json#L23-L44

ajfisch commented 6 years ago

Thanks for the pointer!

Correct me if I'm wrong, but from the code it seems like the embedding will see just one set of inputs (tokens) as does the character_encoding (token_characters). They will independently compute 100d character and token embeddings, which are concatenated and fed to the 200d phrase lstm, in this case.

Using the different concatenated embedding and character_encoding downstream is super clear from the tutorial -- I was wondering if you could write a generic text_field_embedder type that took in both character and token ids to produce embeddings. For example, if a generic embedder architecture like ELMO took in more than just character information.

It seems like no, at least when using the BasicTextFieldEmbedder forward, which only takes in one tensor (which doesn't seem possible to be a tuple).

matt-gardner commented 6 years ago

I don't understand what you're asking. The current code for BiDAF uses a single TextFieldEmbedder to compute word representations that are a concatenation of word ids and character-level encodings. The code that does this is just:

https://github.com/allenai/allennlp/blob/1655f229e8ce5ab2b705549435ab06e1541394f8/allennlp/models/reading_comprehension/bidaf.py#L174-L175

This does as you say; it "takes in both character and token ids to produce embeddings". The input to TextFieldEmbedder is a dictionary, not a tensor, and it embeds all of the tensors in the dictionary. If you can be a little more specific on exactly what you're trying to do, maybe I could help you better.

ajfisch commented 6 years ago

Sorry for not being clear. For example, I would like to move the high way layer computation into the _text_field_embedder. And configure this so that I could use it as a text_field_embedder in other models without modifying code.

I think that all I would have to do is slightly extend the BasicTextFieldEmbedder to do some more stuff before returning (here). But just making sure I'm not missing something that's already there.

Thanks for all the help!

matt-gardner commented 6 years ago

What do you want the highway layer to apply to? The CNN? You can use a different encoder inside of the TokenCharactersEmbedder, which might get what you want. Otherwise I'm not sure what you mean by "most the highway layer into the text field embedder". I'm pretty sure what we have already does what you want. If you give me specific equations, or desired code, or just something more precise, I can tell you how to accomplish what you're looking for.

ajfisch commented 6 years ago

For example, (forgetting about highway for simplicity) I would like to apply a linear layer on top of the concatenated representation.

Specifically I want to embed a sequence of words {x_1, ..., x_N} as {e_1, ..., e_N} , where e_i = ReLu(W * h_i + b) with h_i = [GloVe(x_i); CharCNN(x_i)].

In this case the output {e_1, ..., e_N} of the embedder should be function of both the full token id and sub-word character ids, rather than just a concatenation like [GloVe(x_i); CharCNN(x_i)], which is what the Bidaf embedder returns.

Code-wise, I'm wondering if this is possible to do without writing in:

concatenated_representation = self._text_field_embedder(my_text_input)
combined = self._my_module(concatenated_representation)

for every downstream model that I would like to use this specific embedder for.

ajfisch commented 6 years ago

I think I just got confused between the specific implementation of the BasicTextFieldEmbedder vs the general abstract TextFieldEmbedder. The latter certainly seems flexible enough for me to use.

matt-gardner commented 6 years ago

Ok, thanks for the detail. The most straightforward thing to do is to just have two lines of code in your model, as you suggest (and as we do with BiDAF). But, yeah, if you really want to remove the additional line of code (or have the specific transformation be more configurable from a JSON file), you could write your own TextFieldEmbedder that does what BasicTextFieldEmbedder does, then adds on whatever transformation on top of the concatenated representations that you want. You then register that TextFieldEmbedder, and you can use it with a single line of code in your model. I'm closing this issue now; if you have more questions, feel free to re-open it.

ajfisch commented 6 years ago

Thanks for the help @matt-gardner ! Appreciate it. Good stuff!