GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
590 stars 91 forks source link

Geometry not valid JSON breaks solr index batch process #4494

Open FuhuXia opened 10 months ago

FuhuXia commented 10 months ago

We have fixed the root cause of the issue during harvesting in https://github.com/GSA/data.gov/issues/4373, but the bad Geometry data entered before the fix still causes issue when the affected datasets are reindexed (during db-solr-sync or maybe other batch solr index process)

[ckan.lib.search] Indexing just package 'rail-equipment-accidents-accident-causes'...
[ckanext.spatial.search] Geometry not valid JSON Expecting value: line 1 column 1 (char 0), not indexing :: Latitude/Longitude, County, State
[ckanext.geodatagov] Error while rebuild index 3a6edb7b-4432-4722-9d29-22f7867244a9: AttributeError("'NoneType' object has no attribute 'get'")

How to reproduce

Look at the output of db-solr-sync https://github.com/GSA/catalog.data.gov/actions/runs/6541962407/job/17764268731

Expected behavior

db-solr-sync ignores the error and continues

Actual behavior

db-solr-sync halts.

Sketch

Three appoaches:

  1. handle the invalid Geometry JSON error suring indexing so it does not raise an error, or
  2. make db-solr-sync more robust, like tracking-update, so that it ignores this error and continues, or
  3. manually identify those leftover datasets and update their spatial fields to a valid JSON.
FuhuXia commented 10 months ago

This error is holding up daily db-solr-sync process.