Closed BeckySharp closed 7 years ago
I'm going to make the changes I recommended here, so that you can work on your instance type PR, then the actual model.
Ok, this is done. @bsharpataz, can you look over the changes that I made?
I like the changes, and I agree with you about splitting apart the stemming from the filtering (btw, stop word list was shown to me by tushar -- it exists in a code base somewhere I just don't know how to load it -- it's a Peter Clark list, apparently). The thing that worries me is that if an old experiment doesn't specify "processor" in their experiment file, but they did specify perhaps the NLTK WordSplitter, wouldn't this change break their expected result? since the default would be {} for processor and then "simple" for the splitter?
You're right that this is a non-backwards-compatible API change, but I think that's ok. There isn't anything that uses the NLTK word splitter. Going forward, we'll probably need to think about API changes and backwards compatibility, but I don't think we're to that point yet.
ok, great, thanks
On Mon, Feb 20, 2017 at 11:39 AM, Matt Gardner notifications@github.com wrote:
You're right that this is a non-backwards-compatible API change, but I think that's ok. There isn't anything that uses the NLTK word splitter. Going forward, we'll probably need to think about API changes and backwards compatibility, but I don't think we're to that point yet.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/allenai/deep_qa/pull/212#issuecomment-281165444, or mute the thread https://github.com/notifications/unsubscribe-auth/AFIniYnRCZSreTGjQG41hL18WmWonw5Yks5reevdgaJpZM4ME6xN .
added an intermediate step in the splitting-tokenization pipeline (WordProcessor). This required adding a tiny bit of extra code to word_tokenizer and word_character_tokenizer, but should be backwards compatible. Please check though. When this PR is ready, I can update the docs (didn't want to do that until the structure was finalized).