Open rime1014 opened 6 months ago
Maybe we could even default to a higher number (eg. 200) to also reduce HTTP calls. 200 was used in INSPIRE monitoring exercise in the past and was working fine. Also to improve performances, we can maybe use GetRecords operation only with results instead of requesting each records with GetRecordsById.
Is your feature request related to a problem? Please describe. This is a suggestion for improving harvesting performance by configuring the
maxRecords
value for thegetRecords
request per harvester.An impact of the
getRecord
value on performance was noticed by the following observation in a harvester.We used the profiling tool VisualVM to analyze which methods require the most time during the harvesting process:
The analysis showed that the
align
method with the initialization of theUUIDMapper
class was called twice as often. Therefore, for every 10 data records (instead of 20), a DB query on the metadata table with filtering of the data for the harvester is executed. With 259,188 metadata records, this corresponds to 25,918 DB queries which is evident from the number of geonetwork warningsin the harvester log file.
Before the switch to 10 records, only 12,959 DB queries would have been necessary. Additionally, a matching of the local metadata with the remote metadata is performed for every 10 data records. Therefore, 10 metadata records of the CSW response are compared to all 259,188 metadata records of the harvester stored in the DB. This matching process is repeated 25,918 times (instead of 12,959 times with 20 metadata records within the CSW response). In total about 3.3 billion metadata records were compared during one harvesting process of 259,188 metadata records.
The database queries and matching represent a bottleneck due to partially time-consuming methods (
setDateAndTime
). In addition, moregetRecords
queries against the CSW interface are necessary to retrieve all data.Describe the solution you'd like CSW interfaces might support a higher response value than 20 for
maxRecords
.For each response to the
getRecords
query, thealign
method is called, which creates a new instance of theUUIDMapper
. When theUUIDMapper
is instantiated, thefindAllSimple
method is called, which determines all metadata records already available in the GN for the given harvester with a DB query.With fewer
getRecords
queries due to a highermaxRecords
value, thealign
method is called less often and therefore fewer DB queries are required.An additional setting in the harvester settings to set this value per harvester might significantly improve harvesting performance. Default value: 20
Additional context Result of Visual VM analysis: