bionomia / dwc_agent

Ruby gem to cleanse Darwin Core terms containing people names prior to passing to its dependent parser. Comes with a command-line utility.
MIT License
5 stars 1 forks source link

Treat a string with many commas separating single words as all family names #7

Closed dshorthouse closed 3 years ago

dshorthouse commented 4 years ago

Chaboo, Bennett, Shin

Those are all family names but the parser says:

[#<Name family="Chaboo" given="Bennett">, #<Name given="Shin">]

dshorthouse commented 4 years ago

Another possibility: a.gsub(/^(\S{1,}\.?){1,}\s*(?i:and|&)\s*(\S{1,}\.?){1,}\s*(.*)$/, '\1 \3|\2 \3')

LocoDelAssembly commented 3 years ago

Seems fixed in the sense that it identify three people, but it is assuming they're all given names rather than family names.

2.7.2 :001 > DwcAgent.parse "Chaboo, Bennett, Shin"
 => [#<Name given="Chaboo">, #<Name given="Bennett">, #<Name given="Shin">] 
2.7.2 :002 > DwcAgent.parse "Hardy, Andrews & Giuliani"
 => [#<Name given="Hardy">, #<Name given="Andrews">, #<Name given="Giuliani">] 
2.7.2 :003 > DwcAgent::Version.version
 => "1.5.1.6" 
2.7.2 :004 > 
dshorthouse commented 3 years ago

Thanks for having a look at this. Indeed, what's expected here is that each parsed name then needs to be cleaned:

names = DwcAgent.parse "Chaboo, Bennett, Shin"
DwcAgent.clean names[0]
=> {:title=>nil, :appellation=>nil, :given=>nil, :particle=>nil, :family=>"Chaboo", :suffix=>nil}
LocoDelAssembly commented 3 years ago

Many thanks for pointing out DwcAgent.clean! Was not aware of it. Should it be included in README.md?

dshorthouse commented 3 years ago

Good point. I'll add that.