cisagov / cyhy-system

Cyber Hygiene system and overall documentation/issue tracking
Creative Commons Zero v1.0 Universal
6 stars 0 forks source link

The `places` collection in the MongoDB database has non-ASCII characters in document contents #109

Open mcdonnnj opened 11 months ago

mcdonnnj commented 11 months ago

🐛 Summary

The places collection has non-ASCII (Unicode) characters in document contents. This causes issues when those documents are accessed by code that either breaks or throws errors on attempting to process non-ASCII strings. This most noticeably happens when the cyhy-simple script is used to prepare a JSON for import using the cyhy-import script. The cyhy-simple script will pull in data from the places collection here and when the generated file is imported in cyhy-import it runs afoul of the encoding checks being done in both import functions.

The non-ASCII data is in the source dataset and the GNIS_data_import.py script does not try to remove non-ASCII characters.

To reproduce

Steps to reproduce the behavior:

  1. Populate the places collection using the load_places.sh (which calls GNIS_data_import.py script).
  2. Create a request document JSON using cyhy-simple that uses a feature ID of a place containing non-ASCII information.
  3. Try to import the request document JSON using cyhy-import.

Expected behavior

Any request document JSONs created by cyhy-simple should not fail ASCII encoding checks in cyhy-import.