WordProcessor (terrible name)

allenai / deep_qa

A deep NLP library, based on Keras / tf, focused on question answering (but useful for other NLP too)

Apache License 2.0

404 stars 132 forks source link

WordProcessor (terrible name) #212

Closed BeckySharp closed 7 years ago

BeckySharp commented 7 years ago

added an intermediate step in the splitting-tokenization pipeline (WordProcessor). This required adding a tiny bit of extra code to word_tokenizer and word_character_tokenizer, but should be backwards compatible. Please check though. When this PR is ready, I can update the docs (didn't want to do that until the structure was finalized).

matt-gardner commented 7 years ago

I'm going to make the changes I recommended here, so that you can work on your instance type PR, then the actual model.

matt-gardner commented 7 years ago

Ok, this is done. @bsharpataz, can you look over the changes that I made?

BeckySharp commented 7 years ago

I like the changes, and I agree with you about splitting apart the stemming from the filtering (btw, stop word list was shown to me by tushar -- it exists in a code base somewhere I just don't know how to load it -- it's a Peter Clark list, apparently). The thing that worries me is that if an old experiment doesn't specify "processor" in their experiment file, but they did specify perhaps the NLTK WordSplitter, wouldn't this change break their expected result? since the default would be {} for processor and then "simple" for the splitter?

matt-gardner commented 7 years ago

You're right that this is a non-backwards-compatible API change, but I think that's ok. There isn't anything that uses the NLTK word splitter. Going forward, we'll probably need to think about API changes and backwards compatibility, but I don't think we're to that point yet.

BeckySharp commented 7 years ago

ok, great, thanks

On Mon, Feb 20, 2017 at 11:39 AM, Matt Gardner notifications@github.com wrote:

You're right that this is a non-backwards-compatible API change, but I think that's ok. There isn't anything that uses the NLTK word splitter. Going forward, we'll probably need to think about API changes and backwards compatibility, but I don't think we're to that point yet.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/allenai/deep_qa/pull/212#issuecomment-281165444, or mute the thread https://github.com/notifications/unsubscribe-auth/AFIniYnRCZSreTGjQG41hL18WmWonw5Yks5reevdgaJpZM4ME6xN .