GSA / datagov-ckan-multi

Other
10 stars 6 forks source link

DOI Harvest Source Not Harvesting Properly #472

Closed thejuliekramer closed 3 years ago

thejuliekramer commented 3 years ago

How to reproduce

  1. Visit harvest source page in staging - observe number of failing datasets
Screen Shot 2020-10-02 at 10 49 48 AM

Expected behavior

Dataset & Error count should be comparable to production catalog:

Screen Shot 2020-10-02 at 10 49 55 AM

Actual behavior

Error count is much higher and dataset count is much lower than expected:

Screen Shot 2020-10-02 at 10 49 48 AM
avdata99 commented 3 years ago

Error reproduced locally

2020-10-02 16:05:21,355 INFO  [ckanext.harvest.queue] Received harvest object id: d8f49f28-5068-44ab-bdc4-2119f7624602
2020-10-02 16:05:21,384 DEBUG [ckanext.harvest.harvesters.ckanharvester] In CKANHarvester import_stage
2020-10-02 16:05:21,384 DEBUG [ckanext.harvest.harvesters.ckanharvester] Using config: {u'private_datasets': u'False'}
2020-10-02 16:05:21,476 INFO  [ckanext.harvest.harvesters.base] Package with GUID 14a9bcd1-d610-472a-9133-58928ab5a659 does not exist, let's create it
2020-10-02 16:05:21,491 INFO  [ckanext.geodatagov.logic] chained package_create 2.8.4 {'ignore_auth': True, 'session': <sqlalchemy.orm.scoping.scoped_session object at 0x7f055ea06690>, 'user': u'default', '__auth_audit': [('package_create', 139660928259888)], 'model': <module 'ckan.model' from '/srv/app/src/ckan/ckan/model/__init__.pyc'>, 'api_version': 2, 'schema': {'__before': [<function duplicate_extras_key at 0x7f055dc226e0>, <function ignore at 0x7f055b3981b8>], 'maintainer': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], '__extras': [<function ignore at 0x7f055b3981b8>], 'relationships_as_object': {'comment': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'object': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'state': [<function ignore at 0x7f055b3981b8>], 'type': [<function not_empty at 0x7f055b37bf50>, <OneOf object 4524 list=[u'depends_on', u'dependency_of', u'derives_from', u'has_derivation', u'links_to', u'linked_from', u'child_of', u'parent_of']>], 'id': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'subject': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>]}, 'tag_string': [<function ignore_missing at 0x7f055b3982a8>, <function tag_string_convert at 0x7f055dc22938>], 'private': [<function ignore_missing at 0x7f055b3982a8>, <function boolean_validator at 0x7f055deb3e60>, <function datasets_with_no_organization_cannot_be_private at 0x7f055dc232a8>], 'maintainer_email': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>, <function email_validator at 0x7f055dc236e0>], '__junk': [<function ignore at 0x7f055b3981b8>], 'id': [<function ignore_missing at 0x7f055b3982a8>, <type 'unicode'>], 'owner_org': [<function owner_org_validator at 0x7f055deb3c08>, <function unicode_safe at 0x7f055b398488>], 'title': [<function callable at 0x7f05574c1b18>, <function unicode_safe at 0x7f055b398488>], 'author_email': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>, <function email_validator at 0x7f055dc236e0>], 'state': [<function ignore_not_package_admin at 0x7f055dc22a28>, <function ignore_missing at 0x7f055b3982a8>], 'version': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>, <function package_version_validator at 0x7f055dc22668>], 'license_id': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'save': [<function ignore at 0x7f055b3981b8>], 'type': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'resources': {'__extras': [<function ignore_missing at 0x7f055b3982a8>, <function extras_unicode_convert at 0x7f055b2f0938>, <function keep_extras at 0x7f055b37be60>], 'package_id': [<function ignore at 0x7f055b3981b8>], 'datastore_active': [<function ignore_missing at 0x7f055b3982a8>], 'id': [<function ignore_empty at 0x7f055b398320>, <function unicode_safe at 0x7f055b398488>], 'size': [<function ignore_missing at 0x7f055b3982a8>, <function int_validator at 0x7f055deb3cf8>], 'cache_last_updated': [<function ignore_missing at 0x7f055b3982a8>, <function isodate at 0x7f055deb3ed8>], 'state': [<function ignore at 0x7f055b3981b8>], 'hash': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'description': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'format': [<function if_empty_guess_format at 0x7f055dc23398>, <function ignore_missing at 0x7f055b3982a8>, <function clean_format at 0x7f055dc23410>, <function unicode_safe at 0x7f055b398488>], 'tracking_summary': [<function ignore_missing at 0x7f055b3982a8>], 'last_modified': [<function ignore_missing at 0x7f055b3982a8>, <function isodate at 0x7f055deb3ed8>], 'url_type': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'mimetype': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'cache_url': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'name': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'created': [<function ignore_missing at 0x7f055b3982a8>, <function isodate at 0x7f055deb3ed8>], 'url': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>, <function remove_whitespace at 0x7f055b2f0d70>], 'mimetype_inner': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'position': [<function ignore at 0x7f055b3981b8>], 'revision_id': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'resource_type': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>]}, 'tags': {'revision_timestamp': [<function ignore at 0x7f055b3981b8>], 'vocabulary_id': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>, <function vocabulary_id_exists at 0x7f055dc22f50>], 'state': [<function ignore at 0x7f055b3981b8>], 'display_name': [<function ignore at 0x7f055b3981b8>], 'name': [<function not_missing at 0x7f055b37bed8>, <function not_empty at 0x7f055b37bf50>, <function unicode_safe at 0x7f055b398488>, <function tag_length_validator at 0x7f055dc227d0>, <function tag_name_validator at 0x7f055dc22848>]}, 'groups': {'__extras': [<function ignore at 0x7f055b3981b8>], 'title': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'id': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'name': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>]}, 'log_message': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>, <function no_http at 0x7f055deb3f50>], 'return_to': [<function ignore at 0x7f055b3981b8>], 'relationships_as_subject': {'comment': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'object': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'state': [<function ignore at 0x7f055b3981b8>], 'type': [<function not_empty at 0x7f055b37bf50>, <OneOf object 4525 list=[u'depends_on', u'dependency_of', u'derives_from', u'has_derivation', u'links_to', u'linked_from', u'child_of', u'parent_of']>], 'id': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'subject': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>]}, 'name': [<function not_empty at 0x7f055b37bf50>, <function unicode_safe at 0x7f055b398488>, <function name_validator at 0x7f055dc22578>, <function package_name_validator at 0x7f055dc225f0>], 'url': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'notes': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'author': [<function ignore_missing at 0x7f055b3982a8>, <function unicode_safe at 0x7f055b398488>], 'extras': {'__extras': [<function ignore at 0x7f055b3981b8>], 'deleted': [<function ignore_missing at 0x7f055b3982a8>], 'value': [<function not_missing at 0x7f055b37bed8>], 'revision_timestamp': [<function ignore at 0x7f055b3981b8>], 'state': [<function ignore at 0x7f055b3981b8>], 'key': [<function not_empty at 0x7f055b37bf50>, <function extra_key_not_in_root_schema at 0x7f055dc235f0>, <function unicode_safe at 0x7f055b398488>], 'id': [<function ignore at 0x7f055b3981b8>]}, 'revision_id': [<function ignore at 0x7f055b3981b8>]}} {u'license_title': None, u'maintainer': None, u'relationships_as_object': [], u'private': False, u'maintainer_email': None, u'num_tags': 0, u'id': u'14a9bcd1-d610-472a-9133-58928ab5a659', u'metadata_created': u'2020-08-16T15:55:49.213087', u'metadata_modified': u'2020-08-18T18:46:52.608587', u'author': None, u'author_email': None, u'state': u'active', u'version': None, u'creator_user_id': u'c2d33090-9c8f-4a64-9b09-27cc8fb68d62', u'type': u'dataset', u'resources': [{u'cache_last_updated': None, u'package_id': u'14a9bcd1-d610-472a-9133-58928ab5a659', u'webstore_last_updated': None, u'id': u'4e13fe61-a50f-42db-bd3c-4ddb3421d809', u'size': None, u'state': u'active', u'resource_locator_function': u'', u'hash': u'', u'description': u'The .zip file for the TIFF image includes the image (.tif), the world registration file (.tfw), other associated files, and the metadata for the Backscatter data collected by Fugro Pelagos, Inc. in the Offshore of Gaviota map area, California.', u'format': u'ZIP', u'tracking_summary': {u'total': 0, u'recent': 0}, u'mimetype_inner': None, u'resource_locator_protocol': u'', u'mimetype': None, u'cache_url': None, u'name': u'TIFF', u'created': u'2020-08-18T18:46:54.413007', u'url': u'https://www.sciencebase.gov/catalog/file/get/589b8970e4b0efcedb72d4d2?facet=Backscatter_[Fugro]_OffshoreGaviota.zip', u'webstore_url': None, u'last_modified': None, u'position': 0, u'resource_type': None}, {u'cache_last_updated': None, u'package_id': u'14a9bcd1-d610-472a-9133-58928ab5a659', u'webstore_last_updated': None, u'id': u'672e5b87-5c90-43b6-aa3e-3c1dc02f3e74', u'size': None, u'state': u'active', u'resource_locator_function': u'', u'hash': u'', u'description': u'The .zip file for the TIFF image includes the image (.tif), the world registration file (.tfw), other associated files, and the metadata for the Backscatter data collected by Fugro Pelagos, Inc. in the Offshore of Gaviota map area, California.', u'format': u'TIFF', u'tracking_summary': {u'total': 0, u'recent': 0}, u'mimetype_inner': None, u'resource_locator_protocol': u'', u'mimetype': None, u'cache_url': None, u'name': u'TIFF', u'created': u'2020-08-18T18:46:54.413049', u'url': u'https://doi.org/10.5066/F7TH8JWJ', u'webstore_url': None, u'last_modified': None, u'position': 1, u'resource_type': None}, {u'cache_last_updated': None, u'package_id': u'14a9bcd1-d610-472a-9133-58928ab5a659', u'webstore_last_updated': None, u'id': u'2256931a-1be3-432a-87a7-3d67e3dba88f', u'size': None, u'state': u'active', u'resource_locator_function': u'', u'hash': u'', u'description': u'The .zip file for the TIFF image includes the image (.tif), the world registration file (.tfw), other associated files, and the metadata for the Backscatter data collected by Fugro Pelagos, Inc. in the Offshore of Gaviota map area, California.', u'format': u'TIFF', u'tracking_summary': {u'total': 0, u'recent': 0}, u'mimetype_inner': None, u'resource_locator_protocol': u'', u'mimetype': None, u'cache_url': None, u'name': u'TIFF', u'created': u'2020-08-18T18:46:54.413056', u'url': u'http://pubs.usgs.gov/ds/781/', u'webstore_url': None, u'last_modified': None, u'position': 2, u'resource_type': None}, {u'cache_last_updated': None, u'package_id': u'14a9bcd1-d610-472a-9133-58928ab5a659', u'webstore_last_updated': None, u'id': u'110bcfd3-d0d2-43e7-99a7-06de2cc22ccd', u'size': None, u'state': u'active', u'resource_locator_function': u'', u'hash': u'', u'description': u'', u'format': u'', u'tracking_summary': {u'total': 0, u'recent': 0}, u'mimetype_inner': None, u'resource_locator_protocol': u'', u'mimetype': None, u'cache_url': None, u'name': u'Web Resource', u'created': u'2020-08-18T18:46:54.413061', u'url': u'https://doi.org/10.5066/F7TH8JWJ', u'webstore_url': None, u'last_modified': None, u'position': 3, u'no_real_name': u'True', u'resource_type': None}, {u'cache_last_updated': None, u'package_id': u'14a9bcd1-d610-472a-9133-58928ab5a659', u'webstore_last_updated': None, u'id': u'8c1682a1-4902-46c7-90de-c2f9ead1ba8e', u'size': None, u'state': u'active', u'resource_locator_function': u'', u'hash': u'', u'description': u'', u'format': u'', u'tracking_summary': {u'total': 0, u'recent': 0}, u'mimetype_inner': None, u'resource_locator_protocol': u'', u'mimetype': None, u'cache_url': None, u'name': u'Web Resource', u'created': u'2020-08-18T18:46:54.413066', u'url': u'http://pubs.usgs.gov/ds/781/', u'webstore_url': None, u'last_modified': None, u'position': 4, u'no_real_name': u'True', u'resource_type': None}, {u'cache_last_updated': None, u'package_id': u'14a9bcd1-d610-472a-9133-58928ab5a659', u'webstore_last_updated': None, u'id': u'e107488b-fa41-4d16-aa79-f76154fad7d7', u'size': None, u'state': u'active', u'resource_locator_function': u'', u'hash': u'', u'description': u'', u'format': u'Esri REST', u'tracking_summary': {u'total': 0, u'recent': 0}, u'mimetype_inner': None, u'resource_locator_protocol': u'', u'mimetype': None, u'cache_url': None, u'name': u'Esri REST API Endpoint', u'created': u'2020-08-18T18:46:54.413070', u'url': u'https://www.sciencebase.gov/arcgis/rest/services/Catalog/58a34164e4b0c82512869be3/MapServer', u'webstore_url': None, u'last_modified': None, u'position': 5, u'no_real_name': u'True', u'resource_type': None}, {u'cache_last_updated': None, u'package_id': u'14a9bcd1-d610-472a-9133-58928ab5a659', u'webstore_last_updated': None, u'id': u'6c01e6e5-487b-42ae-8009-53261330177c', u'size': None, u'state': u'active', u'resource_locator_function': u'', u'hash': u'', u'description': u'', u'format': u'WMS', u'tracking_summary': {u'total': 0, u'recent': 0}, u'mimetype_inner': None, u'resource_locator_protocol': u'', u'mimetype': None, u'cache_url': None, u'name': u'ArcGIS Web Mapping Service', u'created': u'2020-08-18T18:46:54.413075', u'url': u'https://www.sciencebase.gov/arcgis/services/Catalog/58a34164e4b0c82512869be3/MapServer/WMSServer?request=GetCapabilities&service=WMS', u'webstore_url': None, u'last_modified': None, u'position': 6, u'no_real_name': u'True', u'resource_type': None}, {u'cache_last_updated': None, u'package_id': u'14a9bcd1-d610-472a-9133-58928ab5a659', u'webstore_last_updated': None, u'id': u'eb5633a8-bab3-4e73-8acd-1a48cc8adb4e', u'size': None, u'state': u'active', u'resource_locator_function': u'', u'hash': u'', u'description': u'', u'format': u'', u'tracking_summary': {u'total': 0, u'recent': 0}, u'mimetype_inner': None, u'resource_locator_protocol': u'', u'mimetype': None, u'cache_url': None, u'name': u'Web Resource', u'created': u'2020-08-18T18:46:54.413079', u'url': u'https://doi.org/10.3133/ofr20181023', u'webstore_url': None, u'last_modified': None, u'position': 7, u'no_real_name': u'True', u'resource_type': None}], u'num_resources': 8, u'tags': [], u'tracking_summary': {u'total': 0, u'recent': 0}, u'license_id': None, u'relationships_as_subject': [], u'organization': {u'description': u'We provide science about the natural hazards that threaten lives and livelihoods, the water, energy, minerals, and other natural resources we rely on, the health of our ecosystems and environment, and the impacts of climate and land-use change. Our scientists develop new methods and tools to enable timely, relevant, and useful information about the Earth and its processes.', u'created': u'2017-11-01T17:01:34.613268', u'title': u'U.S. Geological Survey', u'name': u'u-s-geological-survey', u'is_organization': True, u'state': u'active', u'image_url': u'https://upload.wikimedia.org/wikipedia/commons/0/08/USGS_logo.png', u'revision_id': u'4be6fb48-86a0-41ee-b4bb-c6b32aee20e1', u'type': u'organization', u'id': u'af69efa6-baa3-454a-9e8e-2d417aa0cc00', u'approval_status': u'approved'}, u'name': 'backscatter-fugro-offshore-of-gaviota-map-area-california2e3e3', u'isopen': False, u'url': None, u'notes': u'This part of DS 781 presents 2-m-resolution data collected by Fugro Pelagos for the acoustic-backscatter map of the Offshore of Gaviota Map Area, California. The GeoTiff is included in "Backscatter_[Fugro]_OffshoreGaviota.zip," which is accessible from https://doi.org/10.5066/F7TH8JWJ. These data accompany the pamphlet and map sheets of Johnson, S.Y., Dartnell, P., Cochrane, G.R., Hartwell, S.R., Golden, N.E., Kvitek, R.G., and Davenport, C.W. (S.Y. Johnson and S.A. Cochran, eds.), 2018, California State Waters Map Series\xe2\x80\x94Offshore of Gaviota, California: U.S. Geological Survey Open-File Report 2018\xe2\x80\x931023, pamphlet 41 p., 9 sheets, scale 1:24,000, https://doi.org/10.3133/ofr20181023. The acoustic-backscatter map of the Offshore of Gaviota map area in southern California was generated from acoustic-backscatter data collected by the U.S. Geological Survey (USGS) and by Fugro Pelagos Inc. Acoustic mapping was completed between 2007 and 2008 using a combination of 400-kHz Reson 7125, 240-kHz Reson 8101, and 100-kHz Reson 8111 multibeam echosounders, as well as a 234-kHz SEA SWATHplus bathymetric sidescan-sonar system. These mapping missions combined to collect acoustic-backscatter data from about the 10-m isobath to beyond the limit of California\'s State Waters.', u'owner_org': u'f08012cc-f758-4127-a28b-bd48a0165eae', u'extras': [{u'key': u'bbox-east-long', u'value': u'-120.183012'}, {u'key': u'resource-type', u'value': u'dataset'}, {u'key': u'bbox-north-lat', u'value': u'34.475240'}, {u'key': u'coupled-resource', u'value': u'[]'}, {u'key': u'graphic-preview-type', u'value': u'JPEG'}, {u'key': u'guid', u'value': u''}, {u'key': u'graphic-preview-file', u'value': u'https://www.sciencebase.gov/catalog/file/get/5898ecd9e4b0efcedb70779e?name=Backscatter_[Fugro]_OffshoreGaviota.jpg&allowOpen=true'}, {u'key': u'metadata-language', u'value': u''}, {u'key': u'spatial-reference-system', u'value': u''}, {u'key': u'spatial_harvester', u'value': True}, {u'key': u'spatial', u'value': u'{"type": "Polygon", "coordinates": [[[-120.418584, 34.39471], [-120.183012, 34.39471], [-120.183012, 34.47524], [-120.418584, 34.47524], [-120.418584, 34.39471]]]}'}, {u'key': u'progress', u'value': u'completed'}, {u'key': u'access_constraints', u'value': u'["Use Constraints: USGS-authored or produced data and information are in the public domain from the U.S. Government and are freely redistributable with proper metadata and source attribution. Please recognize and acknowledge the U.S. Geological Survey as the originator(s) of the dataset and in products derived from these data. This information is not intended for navigation purposes.", "Access Constraints: None"]'}, {u'key': u'temporal-extent-begin', u'value': u'2007-01-01'}, {u'key': u'contact-email', u'value': u'pcmsc_data@usgs.gov'}, {u'key': u'bbox-west-long', u'value': u'-120.418584'}, {u'key': u'metadata-date', u'value': u'2018-08-31'}, {u'key': u'dataset-reference-date', u'value': u'[{"type": "publication", "value": "2017-01-01"}]'}, {u'key': u'graphic-preview-description', u'value': u'Backscatter collected by Fugro Pelagos, Inc. in the Offshore of Gaviota map area'}, {u'key': u'frequency-of-update', u'value': u'notPlanned'}, {u'key': u'licence', u'value': u'["Unless otherwise stated, all data, metadata and related materials are considered to satisfy the quality standards relative to the purpose for which the data were collected. Although these data and associated metadata have been reviewed for accuracy and completeness and approved for release by the U.S. Geological Survey (USGS), no warranty expressed or implied is made regarding the display or utility of the data on any other system or for general or scientific purposes, nor shall the act of distribution constitute any such warranty."]'}, {u'key': u'metadata_type', u'value': u'geospatial'}, {u'key': u'responsible-party', u'value': u'[{"name": "U.S. Geological Survey, Pacific Coastal and Marine Science Center", "roles": ["pointOfContact"]}]'}, {u'key': u'temporal-extent-end', u'value': u'2008-01-01'}, {u'key': u'spatial-data-service-type', u'value': u''}, {u'key': u'bbox-south-lat', u'value': u'34.394710'}, {u'key': u'tags', u'value': u'marine geophysics, pacific ocean, state of california, backscatter/seafloortopography, oceans, santa barbara county, sea-floor acoustic reflectivity, gaviota, seafloor topography, backscatter'}, {u'key': u'harvest_object_id', u'value': u'e92924a5-e7ba-493d-b4e0-36ca058fcf76'}, {u'key': u'harvest_source_id', u'value': u'00f54869-3a43-480d-a67c-0ba115f5d739'}, {u'key': u'harvest_source_title', u'value': u'USGS-Harvest'}], u'title': u'Backscatter [Fugro]--Offshore of Gaviota Map Area, California', u'revision_id': u'118588c3-e41e-4670-90ed-2c0a2a70406b'}
2020-10-02 16:05:21,504 ERROR [ckanext.harvest.harvesters.base] {u'extras': [{'key': ['There is a schema field with the same name']}, {}]}
Traceback (most recent call last):
  File "/srv/app/src_extensions/ckanext-harvest/ckanext/harvest/harvesters/base.py", line 372, in _create_or_update_package
    else 'package_create_rest')(context, package_dict)
  File "/srv/app/src/ckan/ckan/logic/__init__.py", line 466, in wrapped
    result = _action(context, data_dict, **kw)
  File "/srv/app/src/ckanext-geodatagov/ckanext/geodatagov/logic.py", line 382, in package_create
    return up_func(context, data_dict)
  File "/srv/app/src/ckan/ckan/logic/action/create.py", line 177, in package_create
    raise ValidationError(errors)
ValidationError: {u'extras': [{'key': ['There is a schema field with the same name']}, {}]}
2020-10-02 16:05:21,527 DEBUG [ckanext.harvest.model] Invalid package with GUID 14a9bcd1-d610-472a-9133-58928ab5a659: {u'extras': [{'key': ['There is a schema field with the same name']}, {}]}
avdata99 commented 3 years ago

Maybe this commit in CKAN fork is related By applying this in CKAN 2.8 the error disappears.

To capture in a simple test case (now we need to run a 20K datasets source to capture the error) we need to try to harvest a CKAN source with the tag extra in a dataset.

avdata99 commented 3 years ago

How to QA:

avdata99 commented 3 years ago

New error found harvesting DOI source (after harvest 13907 datasets)

[ckanext.harvest.model] Invalid package with {u'tags': [u'Tag "magnetic field (earth)" must be alphanumeric characters or symbols: -_.'} 

Context:

[ckanext.harvest.queue] Received harvest object id: c93128dd-bb66-4cec-84d1-b365bdf9253a                       
2020-10-06 19:40:51,809 DEBUG [ckanext.harvest.harvesters.ckanharvester] In CKANHarvester import_stage                                       
2020-10-06 19:40:51,810 DEBUG [ckanext.harvest.harvesters.ckanharvester] Using config: {u'private_datasets': u'False'}                       
2020-10-06 19:40:52,066 INFO  [ckanext.harvest.harvesters.base] Package with GUID ffff6374-2008-4a8f-9f89-5358f7855fa1 does not exist, let's create it                    
2020-10-06 19:40:52,091 INFO  [ckanext.geodatagov.logic] chained package_create 2.8.4 Magnetotelluric sounding data, station 7, Taos Plateau Volcanic Field, New Mexico, 2
009                                                                                
2020-10-06 19:40:52,091 INFO  [ckanext.geodatagov.logic] dataset tag found         
        []                                                                         
Update with                                                                        
        [{'display_name': u'tipper', 'name': u'tipper'}, {'display_name': u'aquifer', 'name': u'aquifer'}, {'display_name': u'cggsc', 'name': u'cggsc'}, {'display_name': u'sounding', 'name': u'sounding'}, {'display_name': u'apparent resistivity', 'name': u'apparent resistivity'}, {'display_name': u'crustal geophysics and geochemistry science center', 'name': u'crustal geophysics and geochemistry science center'}, {'display_name': u'electromagnetic surveying', 'name': u'electromagnetic surveying'}, {'display_name': u'magnetic field (earth)', 'name': u'magnetic field (earth)'}, {'display_name': u'cerro', 'name': u'cerro'}, {'display_name': u'magnetic surveying', 'name': u'magnetic surveying'}, {'display_name': u'taos', 'name': u'taos'}, {'display_name': u'impedance phase', 'name': u'impedance phase'}, {'display_name': u'mineral resources program', 'name': u'mineral resources program'}, {'display_name': u'rocky mountains', 'name': u'rocky mountains'}, {'display_name': u'magnetotelluric', 'name': u'magnetotelluric'}, {'display_name': u'geophysics', 'name': u'geophysics'}, {'display_name': u'new mexico', 'name': u'new mexico'}, {'display_name': u'impedance', 'name': u'impedance'}, {'display_name': u'hydrogeology', 'name': u'hydrogeology'}, {'display_name': u'taos county', 'name': u'taos county'}, {'display_name': u'questa', 'name': u'questa'}, {'display_name': u'gps measurement', 'name': u'gps measurement'}, {'display_name': u'tres piedras', 'name': u'tres piedras'}, {'display_name': u'san luis valley', 'name': u'san luis valley'}, {'display_name': u'mrp', 'name': u'mrp'}, {'display_name': u'mt', 'name': u'mt'}, {'display_name': u'groundwater', 'name': u'groundwater'}, {'display_name': u'impedance strike', 'name': u'impedance strike'}]                       
2020-10-06 19:40:52,114 ERROR [ckanext.harvest.harvesters.base] {u'tags': [{}, {}, {}, {}, {}, {}, {}, u'Tag "magnetic field (earth)" must be alphanumeric characters or s
ymbols: -_.', {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}]}                                               
Traceback (most recent call last):                                                 
  File "/srv/app/src/ckanext-harvest/ckanext/harvest/harvesters/base.py", line 372, in _create_or_update_package                             
    else 'package_create_rest')(context, package_dict)                             
  File "/srv/app/src/ckan/ckan/logic/__init__.py", line 466, in wrapped            
    result = _action(context, data_dict, **kw)                                     
  File "/srv/app/src/ckanext-geodatagov/ckanext/geodatagov/logic.py", line 394, in package_create                                            
    return up_func(context, data_dict)                                             
  File "/srv/app/src/ckan/ckan/logic/action/create.py", line 177, in package_create
    raise ValidationError(errors)                                                  
ValidationError: {u'tags': [{}, {}, {}, {}, {}, {}, {}, u'Tag "magnetic field (earth)" must be alphanumeric characters or symbols: -_.', {}, {}, {}, {}, {}, {}, {}, {}, {
}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}]}                                    
2020-10-06 19:40:52,139 DEBUG [ckanext.harvest.model] Invalid package with GUID ffff6374-2008-4a8f-9f89-5358f7855fa1: {u'tags': [{}, {}, {}, {}, {}, {}, {}, u'Tag "magnet
ic field (earth)" must be alphanumeric characters or symbols: -_.', {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}]} 
avdata99 commented 3 years ago

This is the dataset that fails: https://catalog.data.gov/api/3/action/package_show?id=magnetotelluric-sounding-data-station-7-taos-plateau-volcanic-field-new-mexico-2009

whit an extra as:

{
key: "tags",
value: "tipper, aquifer, cggsc, sounding, apparent resistivity, 
     crustal geophysics and geochemistry science center,
     electromagnetic surveying, magnetic field (earth), cerro, magnetic surveying, 
     taos, impedance phase, mineral resources program, rocky mountains, magnetotelluric, 
    geophysics, new mexico, impedance, hydrogeology, taos county, questa, 
    gps measurement, tres piedras, san luis valley, mrp, mt, groundwater, impedance strike"
},
thejuliekramer commented 3 years ago

Fix tags PR confirmed locally - will deploy to sandbox

Screen Shot 2020-10-07 at 7 18 49 PM
thejuliekramer commented 3 years ago

We are getting closer... we are at 75% of the datasets in Production. It looks like we actually are able to import and are getting the same number of errors but we are receiving an harvest object update error as you can see below.

Screen Shot 2020-10-12 at 5 29 47 PM
thejuliekramer commented 3 years ago

@jbrown-xentity do the above errors look familiar to you? I think I have deleted and re-added and re-harvested this source too many times and now I am seeing this conflict- wondering if there is a change I can make using the harvester extension commands or directly in the DB to fix this. CC @avdata99

jbrown-xentity commented 3 years ago

We have some documentation. This is our standard "the harvest is stuck and won't finish" type fix. This is documentation that was created when the system got really messed up and there were multiple copies of harvest sources and datasets and cross-relationships between the duplicates. Cleaning them up required some manual edits...

thejuliekramer commented 3 years ago

@jbrown-xentity thank you so much for that info - we are now able to harvest

thejuliekramer commented 3 years ago

@hkdctol this is now harvesting properly - it is running right now but you can check the progress here https://admin-catalog-next-datagov.dev-ocsit.bsp.gsa.gov/harvest/doi-open-data/

thejuliekramer commented 3 years ago

Created https://github.com/GSA/datagov-ckan-multi/issues/497 to track why we need the script to clear out old DOI harvest sources when there is more than one

thejuliekramer commented 3 years ago
Screen Shot 2020-10-22 at 8 28 11 AM
hkdctol commented 3 years ago

Harvest completed.