bionomia / dwc_agent

Ruby gem to cleanse Darwin Core terms containing people names prior to passing to its dependent parser. Comes with a command-line utility.
MIT License
5 stars 1 forks source link

Consider using black, grey, white lists #1

Open mjy opened 6 years ago

mjy commented 6 years ago

When I know a word is "never" a person's name, then it should go on a list, and this list can be used to narrow possibilities pre-parse.

dshorthouse commented 6 years ago

Will need to restructure this regex-based cleansing routine and make it amenable to contributions to simple white, grey, black lists. But we actually need more than that or to rename these lists to something more explicit about the expected outcome. For example, we might have a string, "Matt Yoder, University of of Illinois at Urbana-Champaign" vs. "University of Illinois at Urbana-Champaign". If we had "University of of Illinois at Urbana-Champaign" in a black list, we'd run the risk of not finding then parsing names in the former. But, a string such as "[NO DATA]" is clearly an entry in a black list. And, there's perhaps also the need to recognize "University of of Illinois at Urbana-Champaign" as its own agent, but is currently out of scope. So...

Possible titles for these 4 (yikes!) lists would/could be better as:

  1. Separators
  2. Character Substitutions
  3. Removals
  4. Black List