bionomia / dwc_agent

Ruby gem to cleanse Darwin Core terms containing people names prior to passing to its dependent parser. Comes with a command-line utility.
MIT License
5 stars 1 forks source link

Consider including original text in output for each parsed agent in recordedBy team #13

Open nickynicolson opened 2 years ago

nickynicolson commented 2 years ago

Input: Friis I., Getachew A., Rasmussen F. & Vollesen K.

Current output:

[{"family":"Friis","given":"I.","suffix":null,"particle":null,"dropping_particle":null,"nick":null,"appellation":null,"title":null}
,{"family":"Getachew","given":"A.","suffix":null,"particle":null,"dropping_particle":null,"nick":null,"appellation":null,"title":null}
,{"family":"Rasmussen","given":"F.","suffix":null,"particle":null,"dropping_particle":null,"nick":null,"appellation":null,"title":null}
,{"family":"Vollesen","given":"K.","suffix":null,"particle":null,"dropping_particle":null,"nick":null,"appellation":null,"title":null}]

Proposed: Add an extra property "original" to each agent entry in the list:

[{"family":"Friis","given":"I.","suffix":null,"particle":null,"dropping_particle":null,"nick":null,"appellation":null,"title":null,"original":"Friis I."},
{"family":"Getachew","given":"A.","suffix":null,"particle":null,"dropping_particle":null,"nick":null,"appellation":null,"title":null,"original":"Getachew A."}
,{"family":"Rasmussen","given":"F.","suffix":null,"particle":null,"dropping_particle":null,"nick":null,"appellation":null,"title":null,"original":"Rasmussen F."}
,{"family":"Vollesen","given":"K.","suffix":null,"particle":null,"dropping_particle":null,"nick":null,"appellation":null,"title":null, "original":"Vollesen K."}]

It would then be possible to surmise that this string consists of "agent, agent, agent & agent". Coupled with metadata hints about the source of the recordedBy data, this could help direct the parse strategy for more difficult examples.

dshorthouse commented 2 years ago

Thanks @nickynicolson for the suggestion. In most cases, this would work but then there are cases like this:

Input: A. & B. Smith

Output:

[{"family":"Smith","given":"A.","suffix":null,"particle":null,"dropping_particle":null,"nick":null,"appellation":null,"title":null},{"family":"Smith","given":"B.","suffix":null,"particle":null,"dropping_particle":null,"nick":null,"appellation":null,"title":null}]

So, the original for the first item in the array is merely A.

I suppose this is acceptable, but I'll have to sort-out how to pass along the original bits to the output. Some of the bits are actually parsed whereas others flow through some regex routines.