Closed marcos-lg closed 2 years ago
@dshorthouse we are changing some fields to be arrays instead of strings (see above) and some of these fields are included in the bionomia downloads. I changed them to be arrays too, you can see the changes here.
Is this ok to you? you can also test it in UAT if you want. It's not in production yet.
Thanks, @marcos-lg. I'm not sure what are the implications here, but it sounds like you have introduced a mechanism to explode a string into an array for recordedBy
and identifiedBy
& that these will be expressed as arrays in the avro exports. Correct? If so, this will be a severely breaking change for the Bionomia download format that expects these to be verbatim strings unless these can be concatenated to be precisely the same as that sent by the publisher. Instead of making use of these arrays, I'd be far more comfortable using verbatim fields. Exploding recordedBy
or identifiedBy
into an array is more complicated than other fields in DwC. See https://github.com/bionomia/dwc_agent/blob/master/lib/dwc_agent/constants.rb#L130.
yes @dshorthouse. We are now interpreting those fields and we converted them into an array because sometimes they contain more than 1 value and this way we can improve the search in our portal and in downloads.
But it's ok, I'll change the bionomia download to use the verbatim fields for recordedBy
and identifiedBy
. This way you shouldn't notice any difference.
I just took a closer look at how @MattBlissett had made the queries at https://github.com/gbif/occurrence/blob/dev/occurrence-download/src/main/resources/download-workflow/bionomia/hive-scripts/execute-bionomia-query.q#L89 and it looks like he's use v_recordedBy
and v_identifiedBy
(verbatim equivalents) so am not sure the above changes will affect the Bionomia download at all.
Right. Then we just need to remove the recordedBy
and identifiedBy
and leave the verbatim ones only. Until now the verbatim and the interpreted fields were the same so it seems that they were redundant.
Aha - I drop those two columns in the spark queries at my end and use v_recordedBy
and v_identifiedBy
anyway so it's unlikely that your changes above will matter to the processing of the Bionomia download format.
That said, we might one day work on an Elasticsearch plugin to properly contend with material in recordedBy
or identifiedBy
.
All the downloads formats are adapted and in PROD now.
The issue https://github.com/gbif/pipelines/issues/665 brought some new interpreted fields and changed the typeStatus from string to array.
Some of the new fields added were used before as strings because they were being carried from the verbatim values. But now they are interpreted fields in the basic record.
You can see the changes done in the avro schemas here.
All the download formats and cloud exports needs to be adapted to these changes to either use arrays or convert the arrays into strings.
The changes for ES search and Dwc and csv downloads are here but should be reviewed too.