cisagov / cyhy-system

Cyber Hygiene system and overall documentation/issue tracking
Creative Commons Zero v1.0 Universal
6 stars 0 forks source link

Migrate to latest GNIS data #110

Open mcdonnnj opened 11 months ago

mcdonnnj commented 11 months ago

💡 Summary

We should migrate to the latest GNIS data for government units/populated places information. Additionally we should stop checking if the data is already imported (and not importing if it the check believes it has been imported already).

Motivation and context

The government units and populated places information in the CyHyDB are currently sourced from the old GOVT_UNITS and POP_PLACES datasets. These sources have not been updated since 2021-08-25.

GNIS offers current data downloads from the following: https://prd-tnm.s3.amazonaws.com/index.html?prefix=StagedProducts/GeographicNames/Topical/ which can be reached from the following GNIS page: https://www.usgs.gov/us-board-on-geographic-names/download-gnis-data That page also links out to a mirror of the legacy data we source as an archive section with the following description:

Static legacy domestic and Antarctica names data last updated August 2021. This data will not be refreshed. Includes unmaintained administrative features no longer included in GNIS.

As an additional wrinkle we have not been pulling in any GNIS data when the database instance is redeployed because of this check in the GNIS_data_import.py script. Therefore in addition to sourcing current data we should probably stop checking if it is already imported. Doing so has prevented us from importing new records (due to how the check functions) as well as updating existing records.

Implementation notes

There are some records that are in the old data set that are no longer in the current data set. When I checked the production CyHyDB I found only a single stakeholder that had GNIS data falling in that category. We will need to migrate that stakeholder to a new GNIS location just to be safe. Please see this file for set comparisons between legacy and current datasets.

[!NOTE] This issue should be rolled into the work for #111.

Acceptance criteria