geonetwork / core-geonetwork

GeoNetwork is a catalog application to manage spatially referenced resources. It provides powerful metadata editing and search functions as well as an interactive web map viewer. It is currently used in numerous Spatial Data Infrastructure initiatives across the world.
http://geonetwork-opensource.org/
GNU General Public License v2.0
410 stars 487 forks source link

ODS harvesting via simple URL not working #6962

Closed tkohr closed 1 year ago

tkohr commented 1 year ago

Describe the bug Harvesting the following ODS catalog via the simple url harvester (which works on version 4.2.2) does not seem to work anymore. I have the feeling, this is related to the change that the recordIdPath input now expects a path /datasets/datasetid (from the document root?). Or is it just me indicating the wrong path? In version 4.2.2 only the property key datasetid is indicated here.

To Reproduce Steps to reproduce the behavior:

  1. Go in the admin UI to the harvester settings
  2. Add a new harvester of type simple URL with the following params
  3. Save and harvest

Expected behavior Harvest ~208 records from the catalog.

Log file harvester_simpleUrl_MEL_ODS_GN_main_202303301528.log

Desktop (please complete the following information):

jahow commented 1 year ago

Related to break change introduced in https://github.com/geonetwork/core-geonetwork/pull/6677

fxprunayre commented 1 year ago

Definitely not an ODS expert but it looks like depending on if you request API version 1 and version 2 the datased id path is different.

image

The PR you're pointing at ODS API v2 support was added https://github.com/geonetwork/core-geonetwork/pull/6677/commits/a3db440527f72767d02db326e48cec2324e55d78 and should have preserved compability with version 1 API.

So before 4.2.3, harvesting an ODS API v2 was not working.

Running on main the following harvester config

{"@id":"228","@type":"simpleurl","owner":["1"],"ownerGroup":["2"],"ownerUser":["undefined"],"site":{"name":"6962","uuid":"d3e54543-097d-4bb8-bfe6-0fa9c04bb73d","account":{"use":false,"username":[],"password":[]},"url":"https://opendata.lillemetropole.fr/api/datasets/1.0/search?refine.publisher=M%C3%A9tropole+Europ%C3%A9enne+de+Lille&start=0&rows=20","icon":"blank.png","loopElement":"/datasets","numberOfRecordPath":"/nhits","recordIdPath":"/datasetid","pageSizeParam":"rows","pageFromParam":"start","toISOConversion":"schema:iso19115-3.2018:convert/fromJsonOpenDataSoft"},"content":{"validate":"NOVALIDATION","importxslt":"none","batchEdits":"[]"},"options":{"every":"0 0 0 ? * *","oneRunOnly":false,"overrideUuid":"SKIP","status":"active"},"privileges":[{"@id":"1","operation":[{"@name":"view"},{"@name":"dynamic"},{"@name":"download"}]}],"ifRecordExistAppendPrivileges":false,"info":{"lastRun":"2023-05-05T05:25:19.923Z","running":false,"result":{"added":"224","atomicDatasetRecords":"0","badFormat":"0","collectionDatasetRecords":"0","datasetUuidExist":"0","privilegesAppendedOnExistingRecord":"0","doesNotValidate":"0","xpathFilterExcluded":"0","duplicatedResource":"0","fragmentsMatched":"0","fragmentsReturned":"0","fragmentsUnknownSchema":"0","incompatible":"0","recordsBuilt":"0","recordsUpdated":"0","removed":"0","serviceRecords":"0","subtemplatesAdded":"0","subtemplatesRemoved":"0","subtemplatesUpdated":"0","total":"224","unchanged":"0","unknownSchema":"0","unretrievable":"0","updated":"0","thumbnails":"0","thumbnailsFailed":"0"}}}

for v1 API collects 224 records.

and playing

{"@id":"373","@type":"simpleurl","owner":["1"],"ownerGroup":["2"],"ownerUser":["undefined"],"site":{"name":"6962 v2","uuid":"cc6c2ae1-34a8-4ac6-bd19-8df33098f61b","account":{"use":false,"username":[],"password":[]},"url":"https://opendata.lillemetropole.fr/api/explore/v2.0/catalog/datasets?rows=100","icon":"blank.png","loopElement":"/datasets","numberOfRecordPath":"/nhits","recordIdPath":"/dataset/dataset_id","pageSizeParam":"rows","pageFromParam":"start","toISOConversion":"schema:iso19115-3.2018:convert/fromJsonOpenDataSoft"},"content":{"validate":"NOVALIDATION","importxslt":"none","batchEdits":"[]"},"options":{"every":"0 0 0 ? * *","oneRunOnly":false,"overrideUuid":"SKIP","status":"active"},"privileges":[{"@id":"1","operation":[{"@name":"view"},{"@name":"dynamic"},{"@name":"download"}]}],"ifRecordExistAppendPrivileges":false,"info":{"lastRun":"2023-05-05T05:46:25.882Z","running":false,"result":{"added":"10","atomicDatasetRecords":"0","badFormat":"0","collectionDatasetRecords":"0","datasetUuidExist":"0","privilegesAppendedOnExistingRecord":"0","doesNotValidate":"0","xpathFilterExcluded":"0","duplicatedResource":"0","fragmentsMatched":"0","fragmentsReturned":"0","fragmentsUnknownSchema":"0","incompatible":"0","recordsBuilt":"0","recordsUpdated":"0","removed":"1","serviceRecords":"0","subtemplatesAdded":"0","subtemplatesRemoved":"0","subtemplatesUpdated":"0","total":"10","unchanged":"0","unknownSchema":"0","unretrievable":"0","updated":"0","thumbnails":"0","thumbnailsFailed":"0"}}}

collect 100 records

So this seems fine to me, no?

fxprunayre commented 1 year ago

So your issue was related to

String uuid = this.extractUuidFromIdentifier(record.get(params.recordIdPath).asText());

which only works if the property you need is a property of the loopElement node which is not the case in all JSON harvester and not in ODS API v2. So it was indeed changed to

String uuid = this.extractUuidFromIdentifier(record.at(params.recordIdPath).asText());

This explains why your config in 4.2.2 did not work in 4.2.3. By the way, a quite clear error is reported in the harvester log

2023-05-05T13:42:40,976 ERROR [geonetwork.harvester] -
 Failed to collect record UUID at path datasetid. 
 Error is: Invalid input:
 JSON Pointer expression must start with '/': "datasetid"
tkohr commented 1 year ago

Thanks for looking into this @fxprunayre. Indeed, in the end, it's just the missing / that breaks the ODS config from GN 4.2.2 to > 4.2.2.

I didn't pay attention that the mentioned PR was using ODS v2 having a different hierarchy and keys datasetid, dataset_id from V1, which obscured the problem a little, despite the rather clear error message.

jahow commented 1 year ago

Just to clarify, this had nothing to do with ODS API v2 (which we don't use). It was an error on our side, indeed it works with the new format for the record id pointer.

Thanks @fxprunayre

tkohr commented 1 year ago

FYI, I opened https://github.com/geonetwork/doc/pull/240 regarding this.