geonetwork / core-geonetwork

GeoNetwork is a catalog application to manage spatially referenced resources. It provides powerful metadata editing and search functions as well as an interactive web map viewer. It is currently used in numerous Spatial Data Infrastructure initiatives across the world.
http://geonetwork-opensource.org/
GNU General Public License v2.0
428 stars 489 forks source link

OGC CSW 2.0.2 Harvesting / Performance / Configuration of getRecords-Value #7995

Open rime1014 opened 6 months ago

rime1014 commented 6 months ago

Is your feature request related to a problem? Please describe. This is a suggestion for improving harvesting performance by configuring the maxRecords value for the getRecords request per harvester.

An impact of the getRecord value on performance was noticed by the following observation in a harvester.

[!WARNING] By reducing the response of a CSW harvester to 10 data records (instead of 20), the harvesting time has increased enormously from 13 hours to 26 hours.

We used the profiling tool VisualVM to analyze which methods require the most time during the harvesting process:

The analysis showed that the align method with the initialization of the UUIDMapper class was called twice as often. Therefore, for every 10 data records (instead of 20), a DB query on the metadata table with filtering of the data for the harvester is executed. With 259,188 metadata records, this corresponds to 25,918 DB queries which is evident from the number of geonetwork warnings

Declared number of returned records (10) does not match requested record count (20)

in the harvester log file.

Before the switch to 10 records, only 12,959 DB queries would have been necessary. Additionally, a matching of the local metadata with the remote metadata is performed for every 10 data records. Therefore, 10 metadata records of the CSW response are compared to all 259,188 metadata records of the harvester stored in the DB. This matching process is repeated 25,918 times (instead of 12,959 times with 20 metadata records within the CSW response). In total about 3.3 billion metadata records were compared during one harvesting process of 259,188 metadata records.

The database queries and matching represent a bottleneck due to partially time-consuming methods (setDateAndTime). In addition, more getRecords queries against the CSW interface are necessary to retrieve all data.

Describe the solution you'd like CSW interfaces might support a higher response value than 20 for maxRecords.

For each response to the getRecords query, the align method is called, which creates a new instance of the UUIDMapper. When the UUIDMapper is instantiated, the findAllSimple method is called, which determines all metadata records already available in the GN for the given harvester with a DB query.

With fewer getRecords queries due to a higher maxRecords value, the align method is called less often and therefore fewer DB queries are required.

An additional setting in the harvester settings to set this value per harvester might significantly improve harvesting performance. Default value: 20

Additional context Result of Visual VM analysis: image

fxprunayre commented 6 months ago

Maybe we could even default to a higher number (eg. 200) to also reduce HTTP calls. 200 was used in INSPIRE monitoring exercise in the past and was working fine. Also to improve performances, we can maybe use GetRecords operation only with results instead of requesting each records with GetRecordsById.