AtlasOfLivingAustralia / biocache-hubs

Biocache Hub UI grails plugin
Other
3 stars 30 forks source link

"Recorded by" not processed if it contains an apostrophe and/or >2 names #578

Open sat01a opened 11 months ago

sat01a commented 11 months ago

A user's reported an issue to do with processing the "Recorded by" field. This record appears to be correct: https://biocache.ala.org.au/occurrences/6536f255-a49e-45b3-ba3c-83c62862102d It shows Thomas Mesaglio as the original and [Mesaglio, Thomas] as the processed value. Whereas on this record: https://biocache.ala.org.au/occurrences/04e8e8dd-c8ff-497a-ad48-93a631b373f6 It shows Louis Gerald O'Neill as original and the processed value is blank.

Data team investigation suggests it might be due to the apostrophe, or due to the name having more than 2 parts. It's also believed to be an issue in pipelines specifically.

3,876,171 records have provided recordedBy but don’t have processed value, so this is a very visible issue.

Raised in https://support.ehelp.edu.au/a/tickets/182572

Reported by @timhicks-ala

timhicks-ala commented 11 months ago

Another example has been shared in this record: https://biocache.ala.org.au/occurrences/f6ee114b-1d76-4c4e-93f1-2becfa1e5ef4 The Recorded by field is supplied as Petra Holland but our processed value is Petra, Petra.

adam-collins commented 11 months ago

Due to the large variety of delimiters, abbreviations and name formats in use by data providers, parsing Recorded By is unnecessarily difficult. Putting this into the backlog for now. When there is time it would be worth including a review of all records with unprocessed Recorded By.

adam-collins commented 9 months ago

My preference is to remove the processed version

adam-collins commented 7 months ago

@peggynewman as discussed, using the raw_recordedBy as recordedBy. Pull request https://github.com/gbif/pipelines/pull/987

adam-collins commented 4 months ago

To test that this has been applied, https://biocache-test.ala.org.au/fields?filter=recordedBy lists no raw_recordedBy.