Closed matdillen closed 1 year ago
Thanks for these! It's a great help. I've managed to accommodate some of these, but others will remain intractable.
The most difficult one to contend with is: A. Aceby X. Villavicencio
because there are numerous examples of agent namestrings like M. Alex A. Smith
that are not two agents, but one - shortened form for "Michael Alex Andrew Smith". The und
was a missed separator - that one was easy and was an oversight. The in herb.
and in Hb.
were likewise relatively easy to accommodate. The v.
particle as erroneously stripped out in the cleaning routine and since restored. Version 3.0.4.0 was just pushed to RubyGems.org and should be available soon.
So...
A. Aceby X. Villavicencio
and G. v. Reenen J. Aguirre C. C. Schulte F.
remain too challenging to accommodate without breaking things elsewhere.
Deployed to production and you can test the outcome in a UI at https://bionomia.net/parse
Thanks for the quick reply and update! The M. Alex A. Smith
example makes me wonder how many of those are there. I would expect it's not as common to have initials and full length name parts mixed like that.
I'll see if I can find another workaround.
I've seen some teams provided that fail to get parsed using this gem. I think it might be possible to enable them to be parsed without any (big) collateral. Examples:
A. Aceby X. Villavicencio
G. Lettau in herb. V. J. Grummann
G. v. Reenen J. Aguirre C. C. Schulte F.
Anton Mayer und Franz Petzi
G. Fischer und K. E. Harz
J. Poelt & A. Buschardt in Hb. Buschardt
I wonder if these could not be systematically parsed as:
A. Aceby
andX. Villavicencio
G. Lettau
G. Reenen
andJ. Aguirre
andC. C. Schulte F.
Anton Mayer
andFranz Petzi
G. Fischer
andK. E. Harz
.J. Poelt
andA. Buschardt
That is, add rules so that:
und
is recognized as another conjunction (similar to how it already works forand
,et
anden
)in herb
or variations of it is droppedinitials
non-trivial string
initials
non-trivial string
and so on is split into elements, as long as the string does start with an initial. Sometimes it may end with another initial, this one would then be considered part of the last name string (e.g. the F. in the third example). This is still not perfect, e.g. the correct second name in the third example is actually J. Aguirre C., but it should enable more teams effectively getting parsed. It may be a bit too aggressive an implementation, but I would presume (hopefully correctly) that unparsed teams are more undesirable than leftover errors in the resulting splits or poor splits of unparseable strings.I noticed the
v.
gets dropped from the third example in the current implementation. Not sure why that happens.