The places collection has non-ASCII (Unicode) characters in document contents. This causes issues when those documents are accessed by code that either breaks or throws errors on attempting to process non-ASCII strings. This most noticeably happens when the cyhy-simple script is used to prepare a JSON for import using the cyhy-import script. The cyhy-simple script will pull in data from the places collection here and when the generated file is imported in cyhy-import it runs afoul of the encoding checks being done in both import functions.
The non-ASCII data is in the source dataset and the GNIS_data_import.py script does not try to remove non-ASCII characters.
🐛 Summary
The
places
collection has non-ASCII (Unicode) characters in document contents. This causes issues when those documents are accessed by code that either breaks or throws errors on attempting to process non-ASCII strings. This most noticeably happens when thecyhy-simple
script is used to prepare a JSON for import using thecyhy-import
script. Thecyhy-simple
script will pull in data from theplaces
collection here and when the generated file is imported incyhy-import
it runs afoul of the encoding checks being done in both import functions.The non-ASCII data is in the source dataset and the
GNIS_data_import.py
script does not try to remove non-ASCII characters.To reproduce
Steps to reproduce the behavior:
places
collection using theload_places.sh
(which callsGNIS_data_import.py
script).cyhy-simple
that uses a feature ID of a place containing non-ASCII information.cyhy-import
.Expected behavior
Any request document JSONs created by
cyhy-simple
should not fail ASCII encoding checks incyhy-import
.