NYCPlanning / db-facilities-old

This repo contains all the scripts used to build the City Planning Facilities Database (FacDB). Please note this database and its build and maintenance processes are still in active development.
https://capitalplanning.nyc.gov/facilities
5 stars 3 forks source link

Please note this database and its build and maintenance processes are still in active development.

City Planning Facilities Database (FacDB)

Documentation for Building and Maintaining FacDB

Contents

Prerequisites

  1. Install Node.js
  2. Install PostgreSQL
  3. Install PostGIS

Getting Started

  1. Clone this repo and run npm install inside of it.
  2. Clone Data Loading Scripts repo and run npm install inside of it.
  3. Create a database in your PostgreSQL instance to use for this project
  4. Create an environment variable in your bash profile that provides your DATABASE_URL. This gets used in all the .sh scripts.
    • cd ~/.bash_profile
    • Open .bash_profile in Sublime and add the following code:
    • export DATABASE_URL=postgres://{User}:{Password}@{Host}:{Port}/{Database}
    • Check that it was created successfully with printenv
  5. Plug in your database information in dbconfig.sample.js and save as dbconfig.js.
  6. Generate an API ID and Key for Geoclient. Directions here. Plug these values into the apiCredentials.sample.js and save as apiCredentials.js.
  7. Run sh 1_download.sh
  8. Run sh 2_assembly.sh
  9. Run sh 3_geoprocessing.sh
  10. Run sh 4_deduping.sh
  11. Run sh 5_export.sh

Summary of Build Process and Stages

High Level

1 . Obtaining Data

The build follows an Extract -> Load -> Transform sequence rather than an ETL (Extract-Transform-Load) sequence. All the source datasets are first loaded into PostgreSQL using the Data Loading Scripts scripts. After the required source data is loaded in the PostGIS database, the build or update process begins.

Over 50% of the data sources used for FacDB are available in a machine readable format via the NYC Open Data Portal, NYS Open Data Portal, or an agency's website. Other sources that are not published in this format are generally shared with DCP over email and DCP then puts these files on an FTP for DCP's internal use.

1_download.sh script downloads and loads all the neccesary source datasets for the Facilities Database (FacDB).

2. Assembly

When building the database from scratch, this stage begins with creating the empty FacDB table. Then, for each source data table, a 'config' script is used to insert the records. The desired columns in the source data get mapped to the columns in FacDB schema. Many values also need to be recoded and the facility records then need to be classified. The facilities are classified using categories or descriptions provided by the agency. In general, the final Facility Type categories in FacDB are formatted versions of the original, most granular classification provided by the agency, but there are also cases where the source description was too specific and records were grouped together into broader type categories using keywords in the description.

Critically, during the insert process, an encrypted hash of the full row in the source data table is created and stored in each record in FacDB in the 'hash' field. This hash is the unique identifier that allows records in FacDB to be modified and improved while still being able to link that modified record back to the original format in the source data table. This is essential for the FacDB update process of reconciling refreshed source data against the existing contents in FacDB. The hash is then converted to an integer 'uid' which is used as the true unique identifier for each facility and is used in the Facilities Explorer.

The end product of this Assembly stage is all the records and available attributes from the source datasets formatted, recoded (if necessary), and inserted into the FacDB table. There are many missing geometries and other missing attributes related to location like addresses, zipcodes, BBLs, and BINs.

Diagram of the Assembly process is provided below.

2_assembly.sh script runs all the steps in the Assembly process and is annotated to decribe each of the scripts used.

3. Geoprocessing

Many of the source datasets only provide addresses without coordinates. Records without coordinates are geocoded with the GeoClient API using the Address and either the Borough or ZIP Code to get the BIN, BBL, and other standardized location details. Records that come with both a geometry and an address are also run through GeoClient in a separate batch, to standardize the addresses and fill in other missing information like BIN and BBL.

Records that are provided in the source data with only coordinates and no addresses are processed by doing a spatial join with MapPLUTO to get the BBL and other location related details like Address, Borough, ZIP Code, and BIN when there is a 1-1 BIN-BBL relationship.

The coordinates provided by GeoClient are generally not inside tax lots. There are also many cases where an agency provides coordinates, but the coordinates they provided fall in the road bed, rather than inside a BBL boundary, due to the geocoding technique used by the source agency. For these reasons, the BIN centroid was used to overwrite geometries. If a record does not have a BIN but has been assigned a BBL, the BBL centroid is used to overwrite the geometry instead.

Each record in the database is flagged with a code for the geoprocessing technique that was used to complete all of its information in the processingflag field.

Diagram of the Geoprocessing steps is provided below.

3_geoprocessing.sh script runs all Geoprocessing steps and is annotated to decribe each of the scripts used.

4. Deduping

Several of the source datasets have content which overlaps with other datasets. Duplicate records were identified by querying for all the records which fall on the same BIN or BBL as a record with the same Facility Subgroup or Type, same Facility Name, or same Oversight Agency. The values from the duplicate records were merged before dropping the duplicate records from the database.

Diagram of the Deduping process is provided below.

4_deduping.sh script runs all the steps in the Deduping process and is annotated to decribe each of the scripts used.

5. Exporting

5_export.sh script runs each of the scripts that export each of the data views that get published and are used for the NYC Facilities Explorer and other documentation.

Updating and Maintaining the Database

This process is still in development.

See diagram of proposed Maintenance work flow provided below.

Process Diagrams

Assembly

Assembly Diagram

Geoprocessing

Geoprocessing Diagram

Deduping

Deduping Diagram

Maintenance

Maintenance Diagram