bionomia / dwc_agent

Ruby gem to cleanse Darwin Core terms containing people names prior to passing to its dependent parser. Comes with a command-line utility.
MIT License
5 stars 1 forks source link

Ambiguous name cases ~ unexpected name parsing(?) #16

Closed infinite-dao closed 1 year ago

infinite-dao commented 1 year ago

Hej-hej,

I use https://libraries.io/rubygems/dwc_agent release 3.0.5.0 and found unexpected parsing, but these might be ambiguous name cases where even a human would not know what to do ;-) … similar perhaps to issue #15 (with no given input name separation at all…)

# command line interface
dwcagent "ABR"
# returns []
# command line interface
dwcagent "A. Cano,E." # returns 
# {
#   "family": "Cano",
#   "given": "A.",
#   "suffix": null,
#   "particle": null,
#   "dropping_particle": null,
#   "nick": null,
#   "appellation": null,
#   "title": null
# }

dwcagent "A. Cano,E." | jq '.[] | with_entries(select(.value |.!=null))' 
# just filter out the non null values with tool jq
# {
#  "family": "Cano",
#  "given": "A."
# }

This case is also ambiguous:


Or does the tool need an additional field ambiguous_input or similar to signal, that the program has judged it already for reason and not being a silent failing of parsing?

Greetings Andreas

dshorthouse commented 1 year ago

Thanks @infinite-dao for the observations. First, some background on the command-line behaviour. It does both parsing and "cleaning" for each of the successfully recognized entities. See https://github.com/bionomia/dwc_agent/blob/master/bin/dwcagent. And so, perhaps these can be separated out for more granularity; the "clean" method is primarily a suite of logic statements that attempts to interpret occasional mishaps in the upstream parse method. As you've noted, the resultant output can be an empty JSON array when you may have expectations that at least something is returned. Here's one such example of the sort of "clean" logic at play: https://github.com/bionomia/dwc_agent/blob/master/lib/dwc_agent/cleaner.rb#L46.

dwcagent "ABR" is a tough one.

The dependent, Namae gem produces: [#<Name given="ABR">] whereas some of the additional regex in the present dwc_agent gem removes it all. The rationale here was because of the numerous instances of collection codes that wind-up in dwc:recordedBy or dwc:identifiedBy.

dwcagent "A. Cano,E." is poor behaviour because the Namae gem produces, [#<Name family="A. Cano" given="E.">] as is likely expected here as a compound family name. So, I'll try to tidy this one and write a test for it.

The Namae parser itself is based off a compiled LALR parser. I can't imagine there's much opportunity here to state ambiguity in input, which I'm guessing should be presented as an output with options such as, "could be this, or could be that" with particular scores of certainty/uncertainty.