Closed EgorLakomkin closed 6 years ago
Using POS tags is currently easy, but we don't have NER tags currently available for this. It was easy to add, so I just submitted #430 that adds it. That PR also modifies one of the test configuration files to show how to use POS tag and NER embeddings as additional components of the word representations. Just look at the modifications to tests/fixtures/encoder_decoder/simple_seq2seq/experiment.json
in that PR, and make similar modifications to your configuration file. Let me know if you have any questions.
Specifically, what that configuration does is embed the POS tags, dependency labels, and NER tags for each token (using separate embedding matrices for each one, with configurable dimension), then concatenate those with the word embeddings (no character embeddings in that one, but other configurations show you how to do that). This is almost certainly what you should be doing to get POS tag information into your model. If you have another way you want to do it, for some reason, I can try to help you figure out how to do it.
Thank you very much, I will take a look at your PR.
I am still curious how it is possible to incorporate arbitrary features (e.g. binary flag if token is capitalized, token frequence etc). As I understood TextFieldEmbedders operate on a list of indices, but somehow it would be nice to map Token -> feature vector directly?
For this, there are two options.
The quickest option is to do something like what the SRL model does to incorporate the verb indicator. The DatasetReader
adds a SequenceLabelField
with binary values indicating whether each token is the verb to find arguments for: https://github.com/allenai/allennlp/blob/ed37843d528b0902d73e4d5dde931812fd65835d/allennlp/data/dataset_readers/semantic_role_labeling.py#L277
The model then takes that field, embeds it, and concatenates it manually to the word embedding: https://github.com/allenai/allennlp/blob/ed37843d528b0902d73e4d5dde931812fd65835d/allennlp/models/semantic_role_labeler.py#L119-L122
You could do this for each binary feature you want, but just skip the feature embedding.
If you have a lot of binary features you want to add, though, this could get annoying. So the second option is to write your own TokenIndexer
that converts word tokens into feature vectors using whatever feature extractors you want to use, and a TokenEmbedder
that just passes through the given feature vector. If you want to go this route, I'm happy to look at a work-in-progress PR to let you know if you're doing it the right way.
Great, thank you for the very detailed answer. I will try to make a PR for a second option.
I managed to create a combination of a token indexer that does arbitrary feature extraction and bypass embedder - I will do PR a bit later after I polish the code a bit.
One more thing I was interested to do is to extract features for a particular token given also its context - let's say the whole sequence (in the simplest case, for instance, a part-of-speech tag of a next token). I am not sure if it is possible given the current design of the data processing pipeline, but maybe there is a way?
It'd be a bit of a hack, but you could just save a reference to the whole sentence inside each Token
object, then use that info in your TokenIndexer
. It wouldn't be that bad to add a context
field to Token
, and let callers populate it however they want. We'll have to think about whether it's worth adding that to the official API, but the nice thing about Python is you can do it yourself with duck typing, even if we don't add it. If this isn't clear enough, I can give an example of what I mean.
@matt-gardner Thx, for the advice - adding reference worked very well :)
I'm closing this as I don't see any clear follow on. Feel free to re-open and specify what follow-on you are looking for, if needed!
We addressed the POS/NER feature embeddings, and there's an open PR to address the arbitrary features. It needs a little bit of work, but I'm planning on fixing it up soon, because we'll probably need it for re-implementing the WikiTables parser. We're pretty close to having everything else done, so it should be pretty soon that we finally finish that PR. No strong opinion on keeping this issue open or closed.
First of all, I want to apologize if this question is not suitable for an issue tracker (but I could not find any other place).
I am trying to incorporate POS-tag and NER-type features (one-hot encoded vectors mainly) to the word&character embeddings and I wanted to ask what is the best way to do this using this library?
Thank you in advance!