allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.76k stars 2.25k forks source link

Embedding POS/NER/arbitrary features #428

Closed EgorLakomkin closed 6 years ago

EgorLakomkin commented 7 years ago

First of all, I want to apologize if this question is not suitable for an issue tracker (but I could not find any other place).

I am trying to incorporate POS-tag and NER-type features (one-hot encoded vectors mainly) to the word&character embeddings and I wanted to ask what is the best way to do this using this library?

Thank you in advance!

matt-gardner commented 7 years ago

Using POS tags is currently easy, but we don't have NER tags currently available for this. It was easy to add, so I just submitted #430 that adds it. That PR also modifies one of the test configuration files to show how to use POS tag and NER embeddings as additional components of the word representations. Just look at the modifications to tests/fixtures/encoder_decoder/simple_seq2seq/experiment.json in that PR, and make similar modifications to your configuration file. Let me know if you have any questions.

Specifically, what that configuration does is embed the POS tags, dependency labels, and NER tags for each token (using separate embedding matrices for each one, with configurable dimension), then concatenate those with the word embeddings (no character embeddings in that one, but other configurations show you how to do that). This is almost certainly what you should be doing to get POS tag information into your model. If you have another way you want to do it, for some reason, I can try to help you figure out how to do it.

EgorLakomkin commented 7 years ago

Thank you very much, I will take a look at your PR.

I am still curious how it is possible to incorporate arbitrary features (e.g. binary flag if token is capitalized, token frequence etc). As I understood TextFieldEmbedders operate on a list of indices, but somehow it would be nice to map Token -> feature vector directly?

matt-gardner commented 7 years ago

For this, there are two options.

The quickest option is to do something like what the SRL model does to incorporate the verb indicator. The DatasetReader adds a SequenceLabelField with binary values indicating whether each token is the verb to find arguments for: https://github.com/allenai/allennlp/blob/ed37843d528b0902d73e4d5dde931812fd65835d/allennlp/data/dataset_readers/semantic_role_labeling.py#L277

The model then takes that field, embeds it, and concatenates it manually to the word embedding: https://github.com/allenai/allennlp/blob/ed37843d528b0902d73e4d5dde931812fd65835d/allennlp/models/semantic_role_labeler.py#L119-L122

You could do this for each binary feature you want, but just skip the feature embedding.

If you have a lot of binary features you want to add, though, this could get annoying. So the second option is to write your own TokenIndexer that converts word tokens into feature vectors using whatever feature extractors you want to use, and a TokenEmbedder that just passes through the given feature vector. If you want to go this route, I'm happy to look at a work-in-progress PR to let you know if you're doing it the right way.

EgorLakomkin commented 7 years ago

Great, thank you for the very detailed answer. I will try to make a PR for a second option.

EgorLakomkin commented 7 years ago

I managed to create a combination of a token indexer that does arbitrary feature extraction and bypass embedder - I will do PR a bit later after I polish the code a bit.

One more thing I was interested to do is to extract features for a particular token given also its context - let's say the whole sequence (in the simplest case, for instance, a part-of-speech tag of a next token). I am not sure if it is possible given the current design of the data processing pipeline, but maybe there is a way?

matt-gardner commented 7 years ago

It'd be a bit of a hack, but you could just save a reference to the whole sentence inside each Token object, then use that info in your TokenIndexer. It wouldn't be that bad to add a context field to Token, and let callers populate it however they want. We'll have to think about whether it's worth adding that to the official API, but the nice thing about Python is you can do it yourself with duck typing, even if we don't add it. If this isn't clear enough, I can give an example of what I mean.

EgorLakomkin commented 7 years ago

@matt-gardner Thx, for the advice - adding reference worked very well :)

schmmd commented 6 years ago

I'm closing this as I don't see any clear follow on. Feel free to re-open and specify what follow-on you are looking for, if needed!

matt-gardner commented 6 years ago

We addressed the POS/NER feature embeddings, and there's an open PR to address the arbitrary features. It needs a little bit of work, but I'm planning on fixing it up soon, because we'll probably need it for re-implementing the WikiTables parser. We're pretty close to having everything else done, so it should be pretty soon that we finally finish that PR. No strong opinion on keeping this issue open or closed.