GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
667 stars 104 forks source link

Catalog Fetch DCAT-US fails on URL identifier #4040

Open jbrown-xentity opened 2 years ago

jbrown-xentity commented 2 years ago

We get errors when trying to verify if datasets already exist in the fetch harvester for DCAT-US objects that have unique identifiers with : in the string, such as https://something-something.gov. You can see these class of errors in the logs, use space_name:prod app_name:catalog-fetch missing:gauge "Reason: org.apache.solr.search.SyntaxError" as the filter. As of now, there have been 74 instances of this in the last 24 hours.

How to reproduce

  1. Harvest a record with a unique identifier that has https:// in the string

Expected behavior

Successful harvest (using https is actually the recommended behavior by the spec)

Actual behavior

Harvest fails

Sketch

Add a new test covering this use case in ckanext-datajson, then investigate how to properly use CKAN call to search appropriately. If there is a bug in the CKAN system in requesting this data from solr, that will need to be raised. It's also possible that this function call only occurs on a re-harvest, when checking the current system if the data already exists (versus a first harvest, which doesn't check anything). We've done this in the past (see here). The error is occurring here.

jbrown-xentity commented 2 years ago

In trying to write a test case to cover this use case, we ran into problems...

  1. We currently throw an error if the child record is harvested before the parent record, is that desired behavior?
  2. We modify data values on the fly during gather such that we can "notice" them and treat them differently when importing. Not add, modify. This seems wrong
  3. Why do we need the CKAN ID for the parent record, why can't we utilize the already created identifier extra to link objects?

The open item should patch this issue for now, but various issues will arise with this extremely complex code of collections and parents. Ideally we should simplify where we can.