bionomia / dwc_agent

Ruby gem to cleanse Darwin Core terms containing people names prior to passing to its dependent parser. Comes with a command-line utility.
MIT License
5 stars 1 forks source link

Some experiences with the clean method and surnames #17

Open matdillen opened 1 year ago

matdillen commented 1 year ago

I ran into this when trying to match parsed names to Wikidata labels. Abundio Sagástegui Alva seems to have signed his collected specimens at times as A. Sagástegui A. As is still the custom, he had two surnames (from both parents), but for a reason I do not know he abbreviated his maternal surname sometimes.

The gem parses this latter string as given name equal to A. Sagástegui. The clean method reverses this and makes A. Sagástegui the family name. Both are in principle incorrect, but the gem currently typically treats the surname as a single entity, so the original uncleaned parsing (i.e. only the abbreviated Alva as the family name) would be consistent behavior. It also concatenates into the original string again. I don't know exactly why the clean method does this, but is there a way to stop the behavior without breaking something else?

Another parsing issue I encountered was with Aznavour G. V. (i.e. Georges Vincent Aznavour), which is parsed into Aznavour G. and V. and then after cleanreversed as well, concatenated in the end as V. Aznavour G. Is there a complication with treating all initials after a word as given names?

dshorthouse commented 1 year ago

These will be tricky ones. The clean method in the gem tries to accommodate the local practices often seen in the Paris MNHN and in the Meise Botanic Garden where the collector names are often presented in reverse order with no punctuation, eg Groom Q. where the surname is Groom and given name is Q. In this and similar instances, the parse method produces:

[#<struct Namae::Name                             
  family="Q.",                                    
  given="Groom",                                  
  suffix=nil,
  particle=nil,
  dropping_particle=nil,
  nick=nil,
  appellation=nil,
  title=nil>] 

...as does the underlying Namae, dependent gem with its parse method. You see above that the family and given ought to be swapped. Here's the regex/logic in the clean method where this swap happens: https://github.com/bionomia/dwc_agent/blob/master/lib/dwc_agent/cleaner.rb#L67.

And so, with a namestring like A. Sagástegui A. the parse method produces:

[#<struct Namae::Name                 
  family="A.",                        
  given="A. Sagástegui",              
  suffix=nil,                         
  particle=nil,                       
  dropping_particle=nil,              
  nick=nil,                           
  appellation=nil,                    
  title=nil>]

...(as does the underlying Namae gem), which matches the same pattern produced when parsing Groom Q.. The regex on that same L67 above merely looks for a single abbreviated family name & it'd probably be an easy fix to look for more than one initial here to help satisfy Aznavour G. V.. Interestingly, if the space is removed between the "G." and "V." then the clean method would do its thing to swap given and family. This'll be the easiest one to tackle.

Is there a complication with treating all initials after a word as given names?

Well, I guess in the example A. Sagástegui A. the second "A." is not a given name but in the example Groom Q. it is. To crack this nut, we'll have to tinker in the regex within the clean method because the underlying Namae gem produces the same thing. Pre-processing the string prior to parsing it ... it's the Namae gem that does the actual parsing ... is likely to yield undesirable results.

The other issue is when namestrings like these are presented in a list such as:

Goom Q., A. Sagástegui A., Aznavour G. V.

...here's where it gets really complicated 😄

dshorthouse commented 1 year ago

In the example A. Sagástegui A., the desirable output would be:

[#<struct Namae::Name                 
  family="Sagástegui A.",                        
  given="A.",              
  suffix=nil,                         
  particle=nil,                       
  dropping_particle=nil,              
  nick=nil,                           
  appellation=nil,                    
  title=nil>]

But, we'd be fighting with the underlying parsing expression grammar in the Namae gem that would refuse to produce such an output; it was not trained for such a structure, interpreting "Sagástegui" as part of the given name just as we'd interpret "Peter" in "A. Peter Smith" as part of the given name. The only way to deal with this (with the exception of crafting our own racc-based PEG & dropping Namae as a dependency) is to hyphenate "Sagástegui A." in a pre-parse step, but this is ugly.

matdillen commented 1 year ago

The other alternative is to treat the A. as a suffix. While principally incorrect, by abbreviating the surname it's treated as some sort of suffix by whoever uses this sort of syntax. Then it can be covered by a rule similar to how G. Bush Jr. gets parsed. Currently, only the suffix Jr. gets interpreted, but there might be more (e.g. the French fils as f. or the Scandinavian d.ä.) And it makes sense to treat any abbreviation, that occurs after a full word that is itself preceded by an abbreviation, as a suffix?

Still, this syntax may be quite rare. I'll keep you updated when I find more.

dshorthouse commented 1 year ago

The other alternative is to treat the A. as a suffix.

Interesting idea. That's certainly possible via addition to https://github.com/bionomia/dwc_agent/blob/master/lib/dwc_agent/constants.rb#L358. But, if we were to add [A-Z] here, I suspect we'd still be in a situation where parsing Groom Q. would interpret Q. as a suffix & then we'd have to employ some other logic to force it in the clean method to make it an abbreviated given name.