NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
24 stars 1 forks source link

FacDB 24v2 #989

Closed damonmcc closed 6 days ago

damonmcc commented 4 months ago

FacDB page on QA app here


Source Data Updates **All source data listed is to be uploaded as .csv files** ## Scraped by data library - [x] bpl_libraries Source: Scraped from BPL website Source url: https://www.bklynlibrary.org/locations/json - [x] nypl_libraries Source: Scrape from NYPL website Source url: https://www.nypl.org/locations/list - [x] uscourts_courts Source: Court locator for NY state Source url: http://www.uscourts.gov/court-locator/city/New%20York/state/NY - [x] dcp_colp Source: Bytes Source url: https://www1.nyc.gov/site/planning/data-maps/open-data.page#city_facilities Need to specify version when archiving the data (check on the website for date last updated). - [x] nysoasas_programs Source: Scraped from OASAS website Source url: https://webapps.oasas.ny.gov/providerDirectory/index.cfm?search_type=2 Need to set version to today’s date as DD-Mon-YY. Example: 13-Nov-20 ## Source data from OpenData To see if a dataset needs to be uploaded, check date last updated in open data/bytes against version in data library. - [x] dca_operatingbusinesses https://data.cityofnewyork.us/Business/Legally-Operating-Businesses/w7w3-xahh - [x] dcp_facilities_with_unmapped https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-selfac.page - [x] dcla_culturalinstitutions https://data.cityofnewyork.us/Recreation/DCLA-Cultural-Organizations/u35m-9t32 - [x] dfta_contracts https://data.cityofnewyork.us/Social-Services/Department-for-the-Aging-NYC-Aging-All-Contracted-/cqc8-am9x NOTE: `providertype` column has two types of meal categories. One of them, **CITY MEALS ADMINISTRATIVE SERVICES CONTRACTS**, should only have one record (city meals). If there are more records in this category, need to revise code in `dfta_contracts.sql`. - [x] doe_busroutesgarages https://data.cityofnewyork.us/Transportation/Routes/8yac-vygm - [x] dot_pedplazas https://data.cityofnewyork.us/Transportation/Routes/k5k6-6jex - [x] sca_enrollment_capacity https://data.cityofnewyork.us/Education/Enrollment-Capacity-And-Utilization-Reports-Target/8b9a-pywy - [x] dohmh_daycare https://data.cityofnewyork.us/Health/DOHMH-Childcare-Center-Inspections/dsg6-ifza - [x] dpr_parksproperties https://nycopendata.socrata.com/Recreation/Parks-Properties/enfh-gkve NOTE: DPR open data table URLs are not consistent. Be sure to double-check before running from the recipes app. - [x] dsny_garages https://data.cityofnewyork.us/Environment/DSNY-Garages/xw3j-2yxf - [x] dsny_specialwastedrop https://data.cityofnewyork.us/Environment/DSNY-Special-Waste-Drop-off-Sites/242c-ru4i - [x] dsny_donatenycdirectory https://data.cityofnewyork.us/Environment/DSNY-DonateNYC-Directory/gkgs-za6m - [x] dsny_leafdrop https://data.cityofnewyork.us/Environment/Leaf-Drop-Off-Locations-in-NYC/8i9k-4gi5 - [x] dsny_fooddrop https://data.cityofnewyork.us/Environment/Food-Scrap-Drop-Off-Locations-in-NYC/if26-z6xq - [x] dsny_electronicsdrop https://data.cityofnewyork.us/Environment/Electronics-Drop-Off-Locations-in-NYC/wshr-5vic - [x] fdny_firehouses https://data.cityofnewyork.us/Public-Safety/FDNY-Firehouse-Listing/hc8x-tcnd - [x] hhc_hospitals https://data.cityofnewyork.us/Health/Health-and-Hospitals-Corporation-HHC-Facilities/f7b6-v6v3 - [x] hra_jobcenters https://data.cityofnewyork.us/Business/Directory-Of-Job-Centers/9d9t-bmk7 - [x] hra_medicaid https://data.cityofnewyork.us/City-Government/Medicaid-Offices/ibs4-k445 - [x] hra_snapcenters https://data.cityofnewyork.us/Social-Services/Directory-of-SNAP-Centers/tc6u-8rnp - [x] moeo_socialservicesitelocations https://data.cityofnewyork.us/City-Government/Verified-Locations-for-NYC-City-Funded-Social-Serv/2bvn-ky2h - [x] nycha_communitycenters https://data.cityofnewyork.us/Social-Services/Directory-of-NYCHA-Community-Facilities/crns-fw6u - [x] nycha_policeservice https://data.cityofnewyork.us/Housing-Development/NYCHA-PSA-Police-Service-Areas-/72wx-vdjr NOTE: this data is shown as a map on the website. - [x] nysdec_solidwaste https://data.ny.gov/Energy-Environment/Solid-Waste-Management-Facilities/2fni-raj8 - [x] nysdoh_healthfacilities https://health.data.ny.gov/Health/Health-Facility-General-Information/vn5v-hh5r - [x] nysdoh_nursinghomes https://health.data.ny.gov/Health/Nursing-Home-Weekly-Bed-Census-Last-Submission/izta-vnpq - [x] nysed_nonpublicenrollment http://www.p12.nysed.gov/irs/statistics/nonpublic/ - [x] nysomh_mentalhealth https://data.ny.gov/Human-Services/Local-Mental-Health-Programs/6nvr-tbv8 - [x] nysopwdd_providers https://data.ny.gov/Human-Services/Directory-of-Developmental-Disabilities-Service-Pr/ieqx-cqyk - [x] nysparks_historicplaces https://data.ny.gov/Recreation/National-Register-of-Historic-Places/iisn-hnyv - [x] nysparks_parks https://data.ny.gov/Recreation/State-Park-Facility-Points/9uuk-x7vh - [x] qpl_libraries https://data.cityofnewyork.us/Education/Queens-Library-Branches/kh3d-xhq7 - [x] sbs_workforce1 https://data.cityofnewyork.us/dataset/Center-Service-Locations/6smc-7mk6 - [x] usdot_airports https://geodata.bts.gov/datasets/aviation-facilities/explore?location=50.755910%2C-117.686932%2C22.58&showTable=true Use data-library to extract and archive data. Note, data-library uses a different link to extract the data. This link is provided as a reference for last date updated. - [x] usdot_ports https://data-usdot.opendata.arcgis.com/datasets/usdot::docks/about Use data-library to extract and archive data ## Manually check data for updates The source datasets don't currently provide an API or export option on their websites. The data engineering team initially created the datasets and continues to maintain them internally, relying on the information available on the websites. Also, the source data websites don't report date updated as neatly as the open datasets, have to look at data itself. - [x] fbop_corrections https://www.bop.gov/locations/list.jsp When searching by state, there should be 5 NY prisons, 3 of which are in NYC (Brooklyn/New York) - [x] nycdoc_corrections https://www1.nyc.gov/site/doc/about/facilities-locations.page Source: NYCDOC locations directory - [x] nycourts_courts http://www.nycourts.gov/courts/nyc/criminal/generalinfo.shtml#BRONX_COUNTY - [x] nysdoccs_corrections https://doccs.ny.gov/find-facility Hand check for 1 facility in queens, 1 facility in Manhattan, 0 in the other 3 boros. Only look at the correctional facility locations, not the offices. ## Manual download **`TODO`**: update this section to replace any documentation (or just put a TODO) around any "local machine" steps to indicate that we should put these files in edm-recipes/inbox on S3, and change the library template path to point there, e.g. what we do [here](https://github.com/NYCPlanning/data-engineering/blob/cd5c71319b0c093154baf8885b3badf6709451fd/dcpy/library/templates/panynj_jfk_65db.yml#L6). Manually download the following datasets to your local machine and make tweaks if needed per individual instructions. After downloading, use data library CLI to archive the data to S3. Refer to the dataset templates in data library to see where it expects the data – that's where it will search when archiving. Sample CLI command ran locally to archive the `foodbankny_foodbanks` dataset to S3: ```bash library archive --s3 --name foodbankny_foodbanks --latest --version 20240108 --output-format csv ``` - [x] nysdec_lands https://data.gis.ny.gov/datasets/84b4cce8a8974c31a1c5584540f3aaae_0/about - [x] doe_lcgms https://data.cityofnewyork.us/Education/LCGMS-DOE-School-Information-Report/3bkj-34v2 This dataset is updated for CEQR - [x] foodbankny_foodbanks http://www.foodbanknyc.org/get-help/ 1. head to http://www.foodbanknyc.org/get-help/ 2. navigate to the map and make a copy of the map. The map can be found in the "Find Food Near You" drop-down menu. Note, you need to be logged in with a google account in order to have an option to copy the map. 3. After making a copy, click on the three dots next to the target layer and click "Export Data" and export as a csv 4. Rename the file (still as a csv) to match Food_Bank_For_NYC_Open_Members_as_of_DATE(YYYYMMDD). You will need to convert the existing date format MMDDYY to YYYYMMDD so that the version matches existing date format standard in data library. Example: Food_Bank_For_NYC_Open_Members_as_of_20240108.csv 5. place it at the library/tmp folder 6. then run library archive --name foodbankny_foodbanks with the -version flag set to the DATE in the file path url: "http://www.foodbanknyc.org/get-help/" dependents: [] - [x] nysed_activeinstitutions https://eservices.nysed.gov/sedreports/list?id=1 Active Institutions with GIS coordinates and OITS Accuracy Code - Select by County__ CSV. Note that .csv data is automatically downloaded without comma delimiter. Exporting to csv from numbers is one way to get around this issue. (Exporting as an xls and converting to a csv is also an option) - [x] usnps_parks https://irma.nps.gov/DataStore/Reference/Profile/2302064 NOTE: the final number in the URL (2302064) is not always stable. If the data is missing, search through the home. ### Will receive via email or FTP - [x] dot_bridgehouses - [x] dot_ferryterminals - [x] dot_mannedfacilities - [x] dot_publicparking - [ ] doe_universalprek (https://maps.nyc.gov/prek/data/pka/pka.csv) ## Last step - [x] dcp_pops Source: Download from POPs app, available on DCP Commons. Be sure to only take the public version. *Be sure to do this source last, as the OpenData release of POPs needs to be in sync*
damonmcc commented 3 months ago

sent email to request DOT data

sf-dcp commented 3 months ago

Would be nice to add the standard checklist for data loaded, successfully built, etc...

damonmcc commented 3 months ago

requested latest Universal Pre-K data from DOE on 8/12. the last time we got new data was January 2024

will proceed without new data

damonmcc commented 3 months ago

promoted a build to draft, created QA issue, assigned it to GIS

damonmcc commented 1 month ago

running notes on significant changes in 24v2 draft 1 QA

conclusion

I should rebuild with new nysoasas_programs source data

damonmcc commented 1 month ago

notes on 24v2 latest build (may be promoted to draft 3)

queries and data

query for duplicate UNDEVELOPED records

-- records where UID is null were dropped on their way to the table facdb
with facdb as (select * from dm_facdb_qa.facdb where "FACSUBGRP" = 'UNDEVELOPED'),
parks as (select * from dm_facdb_qa."_dpr_parksproperties" where facsubgrp = 'Undeveloped')

select
    facdb."UID",
    parks.*
from 
    parks left join facdb on facdb."UID" = parks.uid
order by facname, boro, wkb_geometry, "UID", uid

Scrap Metal Processor records and the most recent license expiration dates in the versions used in FacDB 24v1 and these builds

this shows that the latest version we have (20240809) doesn't have any expiration dates after the dates of our build

version license_expiration_date count_
20230714 2021-06-30 5
20240809 2021-06-30 5
20230714 2022-06-30 7
20240809 2022-06-30 7
20230714 2023-06-30 14
20240809 2023-06-30 13
20230714 2024-06-30 64
20240809 2024-06-30 65
damonmcc commented 1 month ago

@caseysmithpgh

the latest dm-facdb-qa build looks good! promoted to draft 3 for QA and opened a QA issue

here's a summary of what I did or found about your draft 1 notes:

caseysmithpgh commented 2 weeks ago

@damonmcc @alexrichey @sf-dcp @fvankrieken

Good morning, can someone please promote FacDB Draft 3 to /Publish/latest

damonmcc commented 2 weeks ago

@caseysmithpgh done with this action run

caseysmithpgh commented 2 weeks ago

@damonmcc

Could you please update the source data dates in the Data Sources tab in the FacDB Data Dictionary?

caseysmithpgh commented 2 weeks ago

@damonmcc

Also--just noticed that POPS csv and shp zips that our script expect are not part of the existing latest/ contents here. See 1/31/24 version for expected zip files.

damonmcc commented 2 weeks ago

@caseysmithpgh updated the Data Sources tab!

damonmcc commented 2 weeks ago

Also--just noticed that POPS csv and shp zips that our script expect are not part of the existing latest/ contents here. See 1/31/24 version for expected zip files.

looks like we stopped archiving source data as zipped shapefiles and zipped csvs and didn't account for your need for those formats in edm-recipes/!

I'll generate those files asap and put them in edm-recipes/datasets/dcp_pops/latest

separate from FacDB, DE and GIS should see what other shapefiles your scripts expect to find in edm-recipes/

damonmcc commented 1 week ago

has been distributed to Bytes, I think still needs to be distributed to Open Data