bionomia / dwc_agent

Ruby gem to cleanse Darwin Core terms containing people names prior to passing to its dependent parser. Comes with a command-line utility.
MIT License
5 stars 1 forks source link

Deal with square brackets #4

Closed dshorthouse closed 5 years ago

dshorthouse commented 5 years ago

This is a consequence of the logic in the stand-alone DwcAgent gem used to parse, https://github.com/dshorthouse/dwc_agent. There's a ton of content in recordedBy or identifiedBy with cruft within square brackets either wrapped around all their strings of legible or illegible text or within portions of their content. Most of the examples are the latter, eg "A. Gray [May, 1800]". So, I chose to say (perhaps questionably), "all stuff in brackets is a note-to-self, not a person's name". Even the brackets here in your example here must mean something, right?

That said, I'll see if I can adjust the logic in the gem and let content through when fully wrapped by square brackets. Interestingly though, when I manually remove the brackets in your example, the parsed result is:

DwcAgent.parse "A. Gray (scripsit) W. T. Kittredge, 2014"
=> [#<Name family="Kittredge" given="A. Gray W. T.">]

😢

Originally posted by @dshorthouse in https://github.com/dshorthouse/bloodhound/issues/82#issuecomment-515774905

dshorthouse commented 5 years ago

@kcopas Almost got this one to work in the 0.3.1 version of this gem but the remaining problem is the delimiter between "Gray" and "W. T". There's not enough here to safely assert there are two names because there are many, many examples of like "Alexander (Alex) B. T. Smith" where there is only one name. So, what will work is "[A. Gray (scripsit); W. T. Kittredge (2014)]" though am betting there aren't a lot of examples like that.