Adapt all download formats and exports to use the newly added multivalue fields in pipelines

gbif / occurrence

Occurrence store, download, search

Apache License 2.0

22 stars 15 forks source link

Adapt all download formats and exports to use the newly added multivalue fields in pipelines #283

Closed marcos-lg closed 2 years ago

marcos-lg commented 2 years ago

The issue https://github.com/gbif/pipelines/issues/665 brought some new interpreted fields and changed the typeStatus from string to array.

Some of the new fields added were used before as strings because they were being carried from the verbatim values. But now they are interpreted fields in the basic record.

You can see the changes done in the avro schemas here.

All the download formats and cloud exports needs to be adapted to these changes to either use arrays or convert the arrays into strings.

The changes for ES search and Dwc and csv downloads are here but should be reviewed too.

marcos-lg commented 2 years ago

@dshorthouse we are changing some fields to be arrays instead of strings (see above) and some of these fields are included in the bionomia downloads. I changed them to be arrays too, you can see the changes here.

Is this ok to you? you can also test it in UAT if you want. It's not in production yet.

dshorthouse commented 2 years ago

Thanks, @marcos-lg. I'm not sure what are the implications here, but it sounds like you have introduced a mechanism to explode a string into an array for recordedBy and identifiedBy & that these will be expressed as arrays in the avro exports. Correct? If so, this will be a severely breaking change for the Bionomia download format that expects these to be verbatim strings unless these can be concatenated to be precisely the same as that sent by the publisher. Instead of making use of these arrays, I'd be far more comfortable using verbatim fields. Exploding recordedBy or identifiedBy into an array is more complicated than other fields in DwC. See https://github.com/bionomia/dwc_agent/blob/master/lib/dwc_agent/constants.rb#L130.

marcos-lg commented 2 years ago

yes @dshorthouse. We are now interpreting those fields and we converted them into an array because sometimes they contain more than 1 value and this way we can improve the search in our portal and in downloads.

But it's ok, I'll change the bionomia download to use the verbatim fields for recordedBy and identifiedBy. This way you shouldn't notice any difference.

dshorthouse commented 2 years ago

I just took a closer look at how @MattBlissett had made the queries at https://github.com/gbif/occurrence/blob/dev/occurrence-download/src/main/resources/download-workflow/bionomia/hive-scripts/execute-bionomia-query.q#L89 and it looks like he's use v_recordedBy and v_identifiedBy (verbatim equivalents) so am not sure the above changes will affect the Bionomia download at all.

marcos-lg commented 2 years ago

Right. Then we just need to remove the recordedBy and identifiedBy and leave the verbatim ones only. Until now the verbatim and the interpreted fields were the same so it seems that they were redundant.

dshorthouse commented 2 years ago

Aha - I drop those two columns in the spark queries at my end and use v_recordedBy and v_identifiedBy anyway so it's unlikely that your changes above will matter to the processing of the Bionomia download format.

That said, we might one day work on an Elasticsearch plugin to properly contend with material in recordedBy or identifiedBy.

marcos-lg commented 2 years ago

All the downloads formats are adapted and in PROD now.