bionomia / dwc_agent

Ruby gem to cleanse Darwin Core terms containing people names prior to passing to its dependent parser. Comes with a command-line utility.
MIT License
5 stars 1 forks source link

Parse more team members #15

Closed matdillen closed 1 year ago

matdillen commented 1 year ago

I've seen some teams provided that fail to get parsed using this gem. I think it might be possible to enable them to be parsed without any (big) collateral. Examples:

A. Aceby X. Villavicencio G. Lettau in herb. V. J. Grummann G. v. Reenen J. Aguirre C. C. Schulte F. Anton Mayer und Franz Petzi G. Fischer und K. E. Harz J. Poelt & A. Buschardt in Hb. Buschardt

I wonder if these could not be systematically parsed as:

A. Aceby and X. Villavicencio G. Lettau G. Reenen and J. Aguirre and C. C. Schulte F. Anton Mayer and Franz Petzi G. Fischer and K. E. Harz. J. Poelt and A. Buschardt

That is, add rules so that:

I noticed the v. gets dropped from the third example in the current implementation. Not sure why that happens.

dshorthouse commented 1 year ago

Thanks for these! It's a great help. I've managed to accommodate some of these, but others will remain intractable.

The most difficult one to contend with is: A. Aceby X. Villavicencio because there are numerous examples of agent namestrings like M. Alex A. Smith that are not two agents, but one - shortened form for "Michael Alex Andrew Smith". The und was a missed separator - that one was easy and was an oversight. The in herb. and in Hb. were likewise relatively easy to accommodate. The v. particle as erroneously stripped out in the cleaning routine and since restored. Version 3.0.4.0 was just pushed to RubyGems.org and should be available soon.

So...

A. Aceby X. Villavicencio and G. v. Reenen J. Aguirre C. C. Schulte F. remain too challenging to accommodate without breaking things elsewhere.

dshorthouse commented 1 year ago

Deployed to production and you can test the outcome in a UI at https://bionomia.net/parse

matdillen commented 1 year ago

Thanks for the quick reply and update! The M. Alex A. Smith example makes me wonder how many of those are there. I would expect it's not as common to have initials and full length name parts mixed like that.

I'll see if I can find another workaround.