Open mjy opened 6 years ago
Will need to restructure this regex-based cleansing routine and make it amenable to contributions to simple white, grey, black lists. But we actually need more than that or to rename these lists to something more explicit about the expected outcome. For example, we might have a string, "Matt Yoder, University of of Illinois at Urbana-Champaign" vs. "University of Illinois at Urbana-Champaign". If we had "University of of Illinois at Urbana-Champaign" in a black list, we'd run the risk of not finding then parsing names in the former. But, a string such as "[NO DATA]" is clearly an entry in a black list. And, there's perhaps also the need to recognize "University of of Illinois at Urbana-Champaign" as its own agent, but is currently out of scope. So...
Possible titles for these 4 (yikes!) lists would/could be better as:
When I know a word is "never" a person's name, then it should go on a list, and this list can be used to narrow possibilities pre-parse.