Just about to submit an expensive job to AWS EMR cluster. Gotta make the matcher as flexible as it can be, so that
I can extract as much data as possible.
what?
[x] articles like a, an, the should be set to optional.
[x] handle alternative forms (e.g. shoot 'em up, shoot-em-up)
[x] handling edge cases due to special characters: e.g. sex, drugs, rock 'n' roll (it could be and, &, n, etc)
how?
as always, test-first-dev. write failing tests in test_idiom_matcher. Then fix it.
use wiktionaryparser to obtain alternative forms. Build more patterns with them into the same lemma.
why?
Just about to submit an expensive job to AWS EMR cluster. Gotta make the matcher as flexible as it can be, so that I can extract as much data as possible.
what?
how?
test_idiom_matcher
. Then fix it.