AtlasOfLivingAustralia / biocache-service

Occurrence & mapping webservices
https://biocache-ws.ala.org.au/ws/
Other
9 stars 26 forks source link

DwC Download only shows one entry in recordedBy column #712

Closed nickdos closed 2 years ago

nickdos commented 3 years ago

@nickdos commented on Thu Sep 23 2021

Reported by a user - https://support.ehelp.edu.au/a/tickets/117223.

See the download at https://doi.ala.org.au/doi/10.26197/ala.c10ba085-7a89-42d7-b899-23af93b75858

Example record is row 5 with UUID f3bff74c-8966-41c0-b9a1-080b8a78143c, which shows (column X in CSV):

recordedBy: Barnett, A.M.

Looking at the record itself, shows a different value:

Collector  |  Austin, A.F.\|Barnett, A.M.
               Supplied as "A.F. AUSTIN, A.M. BARNETT"

@nielsklazenga commented on Thu Sep 23 2021

This is because collector used to be a string, but is now a multi-value string. The same thing causes the square brackets around the collectors in the search results. There is also a collectors field. I am not sure which one is the recordedBy field, but an API search with recordedBy in the field list includes both. I think it would be good if one of them could be a string. If the parsing is the only processing that is done, that could be the provided name string.


@brucehyslop commented on Thu Sep 23 2021

The collector and collectors are both mappings to the recordedBy (Solr) field. Searches on any of these fields will return the same results.

In biocache-service the endoints:


@brucehyslop commented on Fri Sep 24 2021

The change made in PR AtlasOfLivingAustralia/biocache-service#698 will fix the issue of only one (the last) entry of the recordedBy multi-value field.

Since the downloads fields are passed from biocache-hub when triggering the download it relatively easy to resolve via config:


@nickdos commented on Fri Sep 24 2021

~DwC download format (list of fields) is dynamically set by parsing the /ws/index/fields file and extracting rows with dwcTerm: "XXX", attribute. So I'd rather not hack in exceptions.~ Forget that - forgot about the downloads.dwcExtraFields config option - can use that.


@nickdos commented on Fri Sep 24 2021

Updated prod load-balanced hubs and ansible inventories. Tried a download and can see its requesting the raw_recordedBy field now:

image


@nielsklazenga commented on Fri Sep 24 2021

My two-cents' worth is that, since dwc:recordedBy is a string and therefore the raw recordedBy is always a string, the processed recordedBy should be a string as well. If, for some internal purpose, it is necessary to index the parsed values, the multi-value field should be given a different name and should not co-opt the Darwin Core term.

Also, recordedBy, identifiedBy and georeferencedBy are the same type of field (have the same object / range / target), so I think they should be treated consistently in ALA. Yet, identifiedBy and georeferencedBy are not multi-value, while recordedBy is.

This is probably more something for stage 3 of the infrastructure project (?) and we are in the biocache-hubs repository, so I will shut up about this now.

nickdos commented 2 years ago

Tested on NCI test and looks good.

https://doi-test.ala.org.au/doi/10.80416/ala.581115cc-67af-4047-8b33-27cd64ce45c8

Second record has 2 collectors and the recordedBy field shows: Austin, A.F. | Barnett, A.M..

timhicks-ala commented 2 years ago

Hi @nickdos - I'll close this one off as being complete historically - recordedBy now appears to contain all collectors in a downloaded set.