NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
22 stars 0 forks source link

FacDB 24v1 #461

Closed damonmcc closed 4 months ago

damonmcc commented 10 months ago

Product Name

facilities

Build Version

24Q1

Status of Update

sf-dcp commented 10 months ago

Update steps (old issue template)

FacDB Source Data Updates

Like most of our data products, source data must be updated in data library before FacDB is run. As there are are many source datasets with varied update processes, this issue template should be opened to track progress towards updating all source data

All source data listed is to be uploaded as .csv files

Geosupport

Version Env

Scraped by data library

Source data from OpenData

To see if a dataset needs to be uploaded, check date last updated in open data against version in data library

Manually check data for updates

These don't report date updated as neatly as the open datasets, have to look at data itself

Manual download

Will receive via email or FTP

Last step

TODO

sf-dcp commented 10 months ago

Notes for source data:

dfta_contracts:

usnps_parks:

doe_universalprek:

fvankrieken commented 10 months ago

@AmandaDoyle - dycd_afterschoolprograms seems to not exist in socrata, and we haven't updated it since 2017.

This seems like a potential replacement, but looks potentially similar to dfta_contracts in that the dataset has changed dramatically

fvankrieken commented 10 months ago

@alexrichey what did you do for usdot_airports, usdot_ports, and nysdec_lands in #51 ?

sf-dcp commented 9 months ago

Following up on dfta_contracts source data.

As mentioned previously, there have been changes in new dfta_contracts version.

1) Column names have changed and there are more columns in the new version overall. We use only three columns in FacDB builds which are present in the new version. There column mapping (old --> new) is the following:

* `contract_type` --> `providertype`
* `provider_id` --> `dfta_id`
* `program_address` --> `programaddress`

2) contract_type column values have changed. This column is used in our build to classify data_contracts records with our own categories for factype column as Senior Center, Senior Services, and Home Delivered Meals. Note, the Home Delivered Meals category used to come from the contract_type column (I.e. we just kept the value from the table instead of creating our own).

There 15 categories in old versions and 12- in the new version. Below are the categories with their corresponding total record counts:

image

Questions for @AmandaDoyle :

cc: @fvankrieken

damonmcc commented 9 months ago

I'm investigating how we'll update use of dycd_afterschoolprograms

the original socrata dataset is gone. in the likely replacement dataset named DYCD Program Sites here afterschool programs are a subset of the records. likely to suggest we consider this a new dataset to start archiving and using

damonmcc commented 9 months ago

noting that usnps_parks was archived today here so that the latest folder has all three default formats: parquet, csv, pgdump

and then archived again using a non-zipped FileGeodatabase here

sf-dcp commented 9 months ago

noting that usnps_parks was archived today here so that the latest folder has all three default formats: parquet, csv, pgdump

and then archived again using a non-zipped FileGeodatabase here

The gdb file was converted with data-library, correct? @damonmcc

damonmcc commented 9 months ago

@fvankrieken

yup! @sf-dcp figured it out locally so then I made changes in #546 and ran in CI

edit: to clarify, the gdb was ingested by data-library and converted to our 3 default formats

noting though that a *.gdb file is pretty much a folder with lots of files in it. I'm not sure if it works with a zipped version of that folder, but that'd be nice

Screenshot 2024-01-24 at 4 44 13 PM
AmandaDoyle commented 9 months ago

@sf-dcp

In the new version, there are 2 meal categories and both of them are named differently when compared to the old one. Do we want to assign them to the old value Home Delivered Meals or do we want to name it something different?

The current logic is WHEN contract_type LIKE '%MEALS%' THEN initcap(contract_type) If the record with provider type that is now CITY MEALS ADMINISTRATIVE SERVICES CONTRACTS was HOME DELIVERED MEALS in the previous version of the data then yes, assign both "HOME DELIVERED MEAL SERVICE CONTRACTS" and "CITY MEALS ADMINISTRATIVE SERVICES CONTRACTS" to Home Delivered Meals.

"OLDER ADULT CENTER CONTRACTS " should be categorized as Senior Center. Tell me more about the records that are "NATURALLY OCCURING RETIREMENT COMMUNITY CONTRACTS." (We can talk about this when we meet). What function do they provide, are they a location where people go?

sf-dcp commented 9 months ago

@sf-dcp

In the new version, there are 2 meal categories and both of them are named differently when compared to the old one. Do we want to assign them to the old value Home Delivered Meals or do we want to name it something different?

The current logic is WHEN contract_type LIKE '%MEALS%' THEN initcap(contract_type) If the record with provider type that is now CITY MEALS ADMINISTRATIVE SERVICES CONTRACTS was HOME DELIVERED MEALS in the previous version of the data then yes, assign both "HOME DELIVERED MEAL SERVICE CONTRACTS" and "CITY MEALS ADMINISTRATIVE SERVICES CONTRACTS" to Home Delivered Meals.

"OLDER ADULT CENTER CONTRACTS " should be categorized as Senior Center. Tell me more about the records that are "NATURALLY OCCURING RETIREMENT COMMUNITY CONTRACTS." (We can talk about this when we meet). What function do they provide, are they a location where people go?

Per further discussion with Amanda, there is only 1 CITY MEALS ADMINISTRATIVE SERVICES CONTRACTS record which appears to be a home delivery service. We will hardcode this contract to be classified as Home Delivered Meals and add a note to the FacDB build template to validate additional CITY MEALS ADMINISTRATIVE SERVICES CONTRACTS records in the future. We don't want to classify the whole category as Home Delivered Meals to avoid the future risk of misclassifying these records as the category is not 100% clear.

Regarding NORC, they seem to be of a service rather than a center type. See description of NORC programs here. Will classify NORC programs accordingly.

cc: @fvankrieken

sf-dcp commented 9 months ago

Shared with GIS team for QA.

sf-dcp commented 6 months ago

@croswell81 & @jackrosacker, just checking in to see if there’s any update on the review?

croswell81 commented 6 months ago

@sf-dcp No, but we plan to do this next sprint

sf-dcp commented 5 months ago

Update: addressed issues from the second QAQC review and shared corrected outputs with GIS. Pending their review.

damonmcc commented 4 months ago

it's on Bytes and I'll distribute to Open Data

damonmcc commented 4 months ago

distributed to Open Data page here

updated metadata in https://github.com/NYCPlanning/product-metadata/pull/6

ran it locally so no action to link to, but could run again using the github action for tracking purposes

update: used github action here to push the same data and ensure test the metadata PR