datagovuk / ckanext-dgu

CKAN extension for data.gov.uk
http://data.gov.uk/
34 stars 33 forks source link

Duplicate UK records appearing on European INSPIRE geoportal #441

Closed davidread closed 8 years ago

davidread commented 8 years ago

e.g. guid cff79fcb-3ca4-3d84-8c30-a28075371fe5

It appears in three metadata documents that seem to have the same content.

They were returned in batches 6421-6440, 16601-16620, 19601-19620.

The look to have the same content:

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160622-223120/services/1/PullResults/6421-6440/datasets/1/

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160622-223120/services/1/PullResults/16601-16620/datasets/4/

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160622-223120/services/1/PullResults/19601-19620/datasets/1/

davidread commented 8 years ago

Initial findings:

2016-06-28 08:07:54,257 ERROR [ckanext.dgu.lib.reports] data.gov.uk: Duplicate guids 3 e.g. {2a9cfb16-de9a-4ed9-8f97-f2582f7a4485} strategic-flood-risk-assessment-zone-3ai
2016-06-28 08:07:54,285 DEBUG [ckanext.dgu.lib.reports] data.gov.uk: 21793 records, 21790 unique
2016-06-28 08:07:54,601 ERROR [ckanext.dgu.lib.reports] data.gov.uk CSW: Duplicate guids 6 e.g. 7cddc839-8e7c-4fd2-b9b4-85e59f1ae463 producers-leasing-direct-sales-milk-quota-by-county-2000-to-2001
2016-06-28 08:07:54,613 DEBUG [ckanext.dgu.lib.reports] data.gov.uk CSW: 21790 records, 21784 unique
2016-06-28 08:07:54,625 ERROR [ckanext.dgu.lib.reports] OS CSW: No duplicate guids
2016-06-28 08:07:54,629 DEBUG [ckanext.dgu.lib.reports] OS CSW: 21113 records, 21113 unique
2016-06-28 08:07:54,889 ERROR [ckanext.dgu.lib.reports] Europe: Duplicate guids 1 e.g. 141f4451-e028-37b3-908c-b2a531a434c7 bathymetric-survey-2003-10-02-alfred-dock-entrance
2016-06-28 08:07:54,897 DEBUG [ckanext.dgu.lib.reports] Europe: 20924 records, 20923 unique
2016-06-28 08:07:54,965 DEBUG [ckanext.dgu.lib.reports] dgu->dgu_csw: Records reduced 21790->21784
2016-06-28 08:07:55,233 ERROR [ckanext.dgu.lib.reports] dgu->dgu_csw: Records missing 6 e.g. ea528945-a6cc-4a6e-86dc-2d304aa3d950 agricultural-land-classification-detailed-post-1988-survey-alcb09294
2016-06-28 08:07:55,265 DEBUG [ckanext.dgu.lib.reports] dgu_csw->os_csw: Records reduced 21784->21113
2016-06-28 08:07:55,522 ERROR [ckanext.dgu.lib.reports] dgu_csw->os_csw: Records added 5 e.g. CEFAS9fa8b61e-5e5e-486f-9714-9495d3613b10 1986-1986-centre-for-environment-fisheries-aquaculture-science-cefas-survey-ecst-1-86-part-of-i
2016-06-28 08:07:55,786 ERROR [ckanext.dgu.lib.reports] dgu_csw->os_csw: Records missing 676 e.g. 5bb29aeb-77ea-4a5a-902a-733217b1a6fd foot-and-mouth-disease-2001-daily-overview-maps-week-commencing-24-09-2001 (created today)
2016-06-28 08:07:55,834 DEBUG [ckanext.dgu.lib.reports] os_csw->europe: Records reduced 21113->20923
2016-06-28 08:07:56,058 ERROR [ckanext.dgu.lib.reports] os_csw->europe: Records missing 190 e.g. ced1ef64-cc56-4384-a679-23dfb5c10070 allerdale-disabled-facilities-grant-land-charge
davidread commented 8 years ago

OS's harvesting of our CSW seems a little eratic, but it is happening every day or 4:

co@prod3 ~ () $ zcat /var/log/ckan/ckan-apache.custom.log.* | egrep '^54.78.12.43' | grep GetCapabilities
54.78.12.43 - - [04/Jun/2016:09:00:53 +0100] "GET /csw?service=CSW&version=2.0.0&request=GetCapabilities HTTP/1.1" 200 6503 "-" "Jakarta Commons-HttpClient/3.0.1"
54.78.12.43 - - [04/Jun/2016:18:18:40 +0100] "GET /csw?service=CSW&version=2.0.0&request=GetCapabilities HTTP/1.1" 200 6503 "-" "Jakarta Commons-HttpClient/3.0.1"
54.78.12.43 - - [26/Jun/2016:09:00:52 +0100] "GET /csw?service=CSW&version=2.0.0&request=GetCapabilities HTTP/1.1" 200 6503 "-" "Jakarta Commons-HttpClient/3.0.1"
54.78.12.43 - - [26/Jun/2016:19:26:17 +0100] "GET /csw?service=CSW&version=2.0.0&request=GetCapabilities HTTP/1.1" 200 6503 "-" "Jakarta Commons-HttpClient/3.0.1"
54.78.12.43 - - [20/Jun/2016:18:52:56 +0100] "GET /csw?service=CSW&version=2.0.0&request=GetCapabilities HTTP/1.1" 200 6503 "-" "Jakarta Commons-HttpClient/3.0.1"
54.78.12.43 - - [19/Jun/2016:09:00:55 +0100] "GET /csw?service=CSW&version=2.0.0&request=GetCapabilities HTTP/1.1" 200 6503 "-" "Jakarta Commons-HttpClient/3.0.1"
54.78.12.43 - - [19/Jun/2016:18:53:45 +0100] "GET /csw?service=CSW&version=2.0.0&request=GetCapabilities HTTP/1.1" 200 6503 "-" "Jakarta Commons-HttpClient/3.0.1"
54.78.12.43 - - [16/Jun/2016:18:48:39 +0100] "GET /csw?service=CSW&version=2.0.0&request=GetCapabilities HTTP/1.1" 200 6503 "-" "Jakarta Commons-HttpClient/3.0.1"
54.78.12.43 - - [15/Jun/2016:18:31:14 +0100] "GET /csw?service=CSW&version=2.0.0&request=GetCapabilities HTTP/1.1" 200 6503 "-" "Jakarta Commons-HttpClient/3.0.1"
54.78.12.43 - - [14/Jun/2016:18:25:35 +0100] "GET /csw?service=CSW&version=2.0.0&request=GetCapabilities HTTP/1.1" 200 6503 "-" "Jakarta Commons-HttpClient/3.0.1"
54.78.12.43 - - [12/Jun/2016:09:00:53 +0100] "GET /csw?service=CSW&version=2.0.0&request=GetCapabilities HTTP/1.1" 200 6503 "-" "Jakarta Commons-HttpClient/3.0.1"
54.78.12.43 - - [12/Jun/2016:18:25:11 +0100] "GET /csw?service=CSW&version=2.0.0&request=GetCapabilities HTTP/1.1" 200 6503 "-" "Jakarta Commons-HttpClient/3.0.1"
54.78.12.43 - - [08/Jun/2016:18:21:48 +0100] "GET /csw?service=CSW&version=2.0.0&request=GetCapabilities HTTP/1.1" 200 6503 "-" "Jakarta Commons-HttpClient/3.0.1"
54.78.12.43 - - [07/Jun/2016:18:25:59 +0100] "GET /csw?service=CSW&version=2.0.0&request=GetCapabilities HTTP/1.1" 200 6503 "-" "Jakarta Commons-HttpClient/3.0.1"

It gets records 20 at a time and it takes 4 hours.

davidread commented 8 years ago

Measurement shows it is of order 1 record duplicate per day.

After exchange, the conclusion is that the version of GeoNetwork is using is a bit old and occasionally repeats IDs in GetRecords requests.

We said we'd move to PyCSW in the next few months.