Open matdillen opened 1 year ago
These will be tricky ones. The clean
method in the gem tries to accommodate the local practices often seen in the Paris MNHN and in the Meise Botanic Garden where the collector names are often presented in reverse order with no punctuation, eg Groom Q.
where the surname is Groom
and given name is Q.
In this and similar instances, the parse
method produces:
[#<struct Namae::Name
family="Q.",
given="Groom",
suffix=nil,
particle=nil,
dropping_particle=nil,
nick=nil,
appellation=nil,
title=nil>]
...as does the underlying Namae
, dependent gem with its parse
method. You see above that the family
and given
ought to be swapped. Here's the regex/logic in the clean
method where this swap happens: https://github.com/bionomia/dwc_agent/blob/master/lib/dwc_agent/cleaner.rb#L67.
And so, with a namestring like A. Sagástegui A.
the parse
method produces:
[#<struct Namae::Name
family="A.",
given="A. Sagástegui",
suffix=nil,
particle=nil,
dropping_particle=nil,
nick=nil,
appellation=nil,
title=nil>]
...(as does the underlying Namae
gem), which matches the same pattern produced when parsing Groom Q.
. The regex on that same L67 above merely looks for a single abbreviated family name & it'd probably be an easy fix to look for more than one initial here to help satisfy Aznavour G. V.
. Interestingly, if the space is removed between the "G." and "V." then the clean
method would do its thing to swap given
and family
. This'll be the easiest one to tackle.
Is there a complication with treating all initials after a word as given names?
Well, I guess in the example A. Sagástegui A.
the second "A." is not a given name but in the example Groom Q.
it is. To crack this nut, we'll have to tinker in the regex within the clean
method because the underlying Namae
gem produces the same thing. Pre-processing the string prior to parsing it ... it's the Namae
gem that does the actual parsing ... is likely to yield undesirable results.
The other issue is when namestrings like these are presented in a list such as:
Goom Q., A. Sagástegui A., Aznavour G. V.
...here's where it gets really complicated 😄
In the example A. Sagástegui A.
, the desirable output would be:
[#<struct Namae::Name
family="Sagástegui A.",
given="A.",
suffix=nil,
particle=nil,
dropping_particle=nil,
nick=nil,
appellation=nil,
title=nil>]
But, we'd be fighting with the underlying parsing expression grammar in the Namae
gem that would refuse to produce such an output; it was not trained for such a structure, interpreting "Sagástegui" as part of the given name just as we'd interpret "Peter" in "A. Peter Smith" as part of the given name. The only way to deal with this (with the exception of crafting our own racc-based PEG & dropping Namae
as a dependency) is to hyphenate "Sagástegui A." in a pre-parse step, but this is ugly.
The other alternative is to treat the A.
as a suffix. While principally incorrect, by abbreviating the surname it's treated as some sort of suffix by whoever uses this sort of syntax. Then it can be covered by a rule similar to how G. Bush Jr. gets parsed. Currently, only the suffix Jr. gets interpreted, but there might be more (e.g. the French fils as f.
or the Scandinavian d.ä.
) And it makes sense to treat any abbreviation, that occurs after a full word that is itself preceded by an abbreviation, as a suffix?
Still, this syntax may be quite rare. I'll keep you updated when I find more.
The other alternative is to treat the
A.
as a suffix.
Interesting idea. That's certainly possible via addition to https://github.com/bionomia/dwc_agent/blob/master/lib/dwc_agent/constants.rb#L358. But, if we were to add [A-Z]
here, I suspect we'd still be in a situation where parsing Groom Q.
would interpret Q.
as a suffix & then we'd have to employ some other logic to force it in the clean
method to make it an abbreviated given name.
I ran into this when trying to match parsed names to Wikidata labels.
Abundio Sagástegui Alva
seems to have signed his collected specimens at times asA. Sagástegui A.
As is still the custom, he had two surnames (from both parents), but for a reason I do not know he abbreviated his maternal surname sometimes.The gem parses this latter string as given name equal to
A. Sagástegui
. The clean method reverses this and makesA. Sagástegui
the family name. Both are in principle incorrect, but the gem currently typically treats the surname as a single entity, so the original uncleaned parsing (i.e. only the abbreviatedAlva
as the family name) would be consistent behavior. It also concatenates into the original string again. I don't know exactly why theclean
method does this, but is there a way to stop the behavior without breaking something else?Another parsing issue I encountered was with
Aznavour G. V.
(i.e. Georges Vincent Aznavour), which is parsed intoAznavour G.
andV.
and then afterclean
reversed as well, concatenated in the end asV. Aznavour G.
Is there a complication with treating all initials after a word as given names?