allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.76k stars 2.25k forks source link

How do I set the constituency parser tokenizer for predict #1333

Closed MQ-MJ closed 6 years ago

MQ-MJ commented 6 years ago

Hi,

The constituency parser trains on tree input (which is implicitly tokenised), but when called with predict it runs its own tokeniser. I'd like to be able to control the run-time tokenisation, so e.g., I'd like to use a run-time tokeniser that splits at white spaces. Any chance of an example of how to do this?

I can see that the tokenisers are defined in allennlp/allennlp/data/tokenizers/word_splitter.py, and I'd like to use the allennlp.data.tokenizers.word_splitter.JustSpacesWordSplitter tokeniser.

I can also see that the tutorial allennlp/tutorials/getting_started/using_as_a_library_pt2.md basically describes how to call a custom tokeniser.

I'm confused by the fact that the SpacyWordSplitter tokeniser that is set to be the tokeniser in ConstituencyParserPredictor in allennlp/allennlp/predictors/constituency_parser.py seems to return POS tags as well as words.

Any advice?

Thanks,

Mark

matt-gardner commented 6 years ago

The tokenizer used by the constituency parser predictor is defined here: https://github.com/allenai/allennlp/blob/4326b59b18c5f3f1eb29b1b3dfe330660d292d39/allennlp/predictors/constituency_parser.py#L67

To switch that tokenizer, you have a couple of options:

  1. If you're running the predictor in code (not from the commandline), you can just overwrite the tokenizer that's there. Something like predictor._tokenizer = JustSpacesWordSplitter() after you create the predictor.

  2. If you're running from the commandline, you need to write your own predictor, switching the tokenizer. This is pretty simple - just copy the code for the existing predictor, change the name that's registered, and change the tokenizer (and be sure to pass the --include-package flag to the allennlp command). You also don't need all of the stuff dealing with hierplane and styles, as that's getting the output ready to display in the demo.

A slightly more satisfying solution would be to make our predictor configurable, either by taking the tokenizer as a constructor parameter, so you can configure it in code without overwriting anything, or by adding some kind of from_params method for predictors. We don't have any mechanism for the latter at this point; it might be helpful, but might also be more work (and more complex) than it's worth.

matt-gardner commented 6 years ago

Oh, and the tokenizer returns a list of Tokens, which can have additional metadata associated with them. We use spacy by default, which has character offsets, and potentially other NLP annotations, like POS tags, lemmatization, and NER output. These are predicted by spacy.

joelgrus commented 6 years ago

I didn't weigh in, because I don't know a lot about this model, but it looks like our trained model requires POS tags, so that if you use a different tokenizer you'd have to still get them somehow, is that right?

matt-gardner commented 6 years ago

Yep, @joelgrus, you're right. I missed that. https://github.com/allenai/allennlp/blob/b6b8955de74b02fbdd67ebe1444cdf86acaf30fd/allennlp/models/constituency_parser.py#L186-L191

The trained model uses POS tag embeddings, so if you don't pass them, the code will crash. You can still do this with a JustSpacesWordSplitter, you just need to read in the pos tags separately and pass them in to the model. So, it looks like you're going to have to write your own Predictor for this. Or you could write your own WordSplitter that reads sentences like "Cats/NNS eat/VB mice/NNS ./." and returns Tokens that have the tag_ field set, then you can overwrite the tokenizer on the existing predictor, like I said above. If you do write your own word splitter like this, creating a PR to add it would be great.

DeNeutoy commented 6 years ago

Hi, yes the consitutency parser in the demo does require pos-tags. Turns out that you actually don't need them (because of elmo), but I haven't retrained a model which doesn't use them yet.

MQ-MJ commented 6 years ago

I figured out how to do this from the tutorials last night, and it's pleasantly easy.

I'd be pleased to try to do this, but I'm not sure how to make the code fit the AllenNLP  requirements (I can probably figure out the Python typing, but I will need help on where/how to add this to the AllenNLP code).

Thanks,

Mark

On 5/6/18 01:54, Matt Gardner wrote:

Yep, @joelgrus https://protect-au.mimecast.com/s/fiIWC81Vq2C000Yu2Tjcy?domain=github.com, you're right. I missed that. https://github.com/allenai/allennlp/blob/b6b8955de74b02fbdd67ebe1444cdf86acaf30fd/allennlp/models/constituency_parser.py#L186-L191 https://protect-au.mimecast.com/s/LPUBC6XQ68fvvvyupCboS?domain=github.com

The trained model uses POS tag embeddings, so if you don't pass them, the code will crash. You can still do this with a |JustSpacesWordSplitter|, you just need to read in the pos tags separately and pass them in to the model. So, it looks like you're going to have to write your own |Predictor| for this. Or you could write your own |WordSplitter| that reads sentences like "Cats/NNS eat/VB mice/NNS ./." and returns |Tokens| that have the |tag_| field set, then you can overwrite the tokenizer on the existing predictor, like I said above. If you do write your own word splitter like this, creating a PR to add it would be great.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://protect-au.mimecast.com/s/BUgUC71R63CKKKZhWyhBy?domain=github.com, or mute the thread https://protect-au.mimecast.com/s/SAc3C91W8rCXXX2sODbw3?domain=github.com.

matt-gardner commented 6 years ago

Which one did you do? If you added a WordSplitter, that seems general enough to be useful for others, and we'd be happy to accept a PR for it. If you overwrote the predictor in your script, great, I'm glad you got it working, I'm just not sure that it's generally useful, and it's probably better for the code to just live in your own repository.

If you have something you think would be useful, feel free to open a PR in whatever state you have your code, and one of us can give pointers for anything that needs fixing.

joelgrus commented 6 years ago

closing, since this seems to be resolved