AmandaDoyle commented 1 year ago

FacDB Source Data Updates

Like most of our data products, source data must be updated in data library before FacDB is run. As there are are many source datasets with varied update processes, this issue template should be opened to track progress towards updating all source data

All source data listed is to be uploaded as .csv files

Scraped by data library

[x] bpl_libraries Source: Scraped from BPL website Source url: https://www.bklynlibrary.org/locations/json
[x] nypl_libraries Source: Scrape from NYPL website Source url: https://www.nypl.org/locations/list
[x] uscourts_courts Source: Court locator for NY state Source url: http://www.uscourts.gov/court-locator/city/New%20York/state/NY

Source data from OpenData

To see if a dataset needs to be uploaded, check date last updated in open data against version in data library

[x] dca_operatingbusinesses https://data.cityofnewyork.us/Business/Legally-Operating-Businesses/w7w3-xahh
[x] dcp_colp https://www1.nyc.gov/site/planning/data-maps/open-data.page#city_facilities
[x] dcla_culturalinstitutions https://data.cityofnewyork.us/Recreation/DCLA-Cultural-Organizations/u35m-9t32
[x] dfta_contracts https://data.cityofnewyork.us/Social-Services/DFTA-Contracts/6j6t-3ixh
[x] doe_busroutesgarages https://data.cityofnewyork.us/Transportation/Routes/8yac-vygm
[x] sca_enrollment_capacity https://data.cityofnewyork.us/Education/Enrollment-Capacity-And-Utilization-Reports-Target/8b9a-pywy
[x] dohmh_daycare https://data.cityofnewyork.us/Health/DOHMH-Childcare-Center-Inspections/dsg6-ifza
[x] dpr_parksproperties https://nycopendata.socrata.com/Recreation/Parks-Properties/enfh-gkve NOTE: DPR open data table URLs are not consistent. Be sure to double-check before running from the recipes app.
[x] dsny_garages https://data.cityofnewyork.us/Environment/DSNY-Garages/xw3j-2yxf
[x] dsny_specialwastedrop https://data.cityofnewyork.us/Environment/DSNY-Special-Waste-Drop-off-Sites/242c-ru4i
[ ] ~~dsny_textiledrop https://data.cityofnewyork.us/Environment/Textile-Drop-Off-Locations-in-NYC/qnjm-wvu5~~
[x] dsny_donatenycdirectory https://data.cityofnewyork.us/Environment/DSNY-DonateNYC-Directory/gkgs-za6m
[x] dsny_leafdrop https://data.cityofnewyork.us/Environment/Leaf-Drop-Off-Locations-in-NYC/8i9k-4gi5
[x] dsny_fooddrop https://data.cityofnewyork.us/Environment/Food-Scrap-Drop-Off-Locations-in-NYC/if26-z6xq
[x] dsny_electronicsdrop https://data.cityofnewyork.us/Environment/Electronics-Drop-Off-Locations-in-NYC/wshr-5vic
[x] dycd_afterschoolprograms https://data.cityofnewyork.us/Education/DYCD-after-school-programs/mbd7-jfnc
[x] fdny_firehouses https://data.cityofnewyork.us/Public-Safety/FDNY-Firehouse-Listing/hc8x-tcnd
[x] hhc_hospitals https://data.cityofnewyork.us/Health/Health-and-Hospitals-Corporation-HHC-Facilities/f7b6-v6v3
[x] hra_jobcenters https://data.cityofnewyork.us/Business/Directory-Of-Job-Centers/9d9t-bmk7
[x] hra_medicaid https://data.cityofnewyork.us/City-Government/Medicaid-Offices/ibs4-k445
[x] hra_snapcenters https://data.cityofnewyork.us/Social-Services/Directory-of-SNAP-Centers/tc6u-8rnp
[x] moeo_socialservicesitelocations https://data.cityofnewyork.us/City-Government/Verified-Locations-for-NYC-City-Funded-Social-Serv/2bvn-ky2h
[x] nycha_communitycenters https://data.cityofnewyork.us/Social-Services/Directory-of-NYCHA-Community-Facilities/crns-fw6u
[x] nycha_policeservice https://data.cityofnewyork.us/Housing-Development/NYCHA-PSA-Police-Service-Areas-/72wx-vdjr
[x] nysdec_solidwaste https://data.ny.gov/Energy-Environment/Solid-Waste-Management-Facilities/2fni-raj8
[x] nysdoh_healthfacilities https://health.data.ny.gov/Health/Health-Facility-General-Information/vn5v-hh5r
[x] nysdoh_nursinghomes https://health.data.ny.gov/Health/Nursing-Home-Weekly-Bed-Census-Last-Submission/izta-vnpq
[x] nysomh_mentalhealth https://data.ny.gov/Human-Services/Local-Mental-Health-Programs/6nvr-tbv8
[x] nysopwdd_providers https://data.ny.gov/Human-Services/Directory-of-Developmental-Disabilities-Service-Pr/ieqx-cqyk
[x] nysparks_historicplaces https://data.ny.gov/Recreation/National-Register-of-Historic-Places/iisn-hnyv
[x] nysparks_parks https://data.ny.gov/Recreation/State-Park-Facility-Points/9uuk-x7vh
[x] qpl_libraries https://data.cityofnewyork.us/Education/Queens-Library-Branches/kh3d-xhq7
[x] sbs_workforce1 https://data.cityofnewyork.us/dataset/Center-Service-Locations/6smc-7mk6
[x] usdot_airports https://hub.arcgis.com/datasets/usdot::airports Head to url >> api >> copy url from geojson
[x] usdot_ports https://hub.arcgis.com/datasets/usdot::ports Head to url >> api >> copy url from geojson
[x] nysdec_lands http://gis.ny.gov/gisdata/inventories/details.cfm?DSID=1114

Manually check data for updates

These don't report date updated as neatly as the open datasets, have to look at data itself

[x] fbop_corrections https://www.bop.gov/locations/list.jsp When searching by state, there should be 5 NY prisons, 3 of which are in NYC (Brooklyn/New York)
[x] nycdoc_corrections https://www1.nyc.gov/site/doc/about/facilities-locations.page Source: NYCDOC locations directory
[x] nycourts_courts http://www.nycourts.gov/courts/nyc/criminal/generalinfo.shtml#BRONX_COUNTY
[x] nysdoccs_corrections https://doccs.ny.gov/find-facility Hand check for 1 facility in queens, 1 facility in Manhattan, 0 in the other 3 boros. Only look at the correctional facility locations, not the offices.

Manual download

[x] doe_lcgms https://data.cityofnewyork.us/Education/LCGMS-DOE-School-Information-Report/3bkj-34v2 This dataset is updated for CEQR
[x] foodbankny_foodbanks http://www.foodbanknyc.org/get-help/ Go to the expanded view of the google maps. Click “Download KML” under the options (three dots). Instead of “Entire Map,” select “Food Bank For NYC Open Sites.” Select. “Keep data up to date with network link KML (only usable online).“ Go to https://mygeodata.cloud/converter/kmz-to-csv to convert the kmz to csv, then use recipe app to load in the csv
[x] nysed_activeinstitutions https://eservices.nysed.gov/sedreports/list?id=1 Active Institutions with GIS coordinates and OITS Accuracy Code - Select by County__ CSV. Note that .csv data is automatically downloaded without comma delimiter. Exporting to csv from numbers is one way to get around this issue
- [x] nysed_nonpublicenrollment http://www.p12.nysed.gov/irs/statistics/nonpublic/ Nonpublic Enrollment by Grade
[x] nysoasas_programs https://webapps.oasas.ny.gov/providerDirectory/index.cfm?search_type=2 Download all treatment providers Modify download URL to contain today’s date: https://webapps.oasas.ny.gov/providerDirectory/download/Treatment_Providers_OASAS_Directory_Search_13-Nov-20.csv <- this url needs to be updated to programs not providers https://webapps.oasas.ny.gov/providerDirectory/download/Treatment_Programs_OASAS_Directory_Search_13-Dec-22.csv
[x] usnps_parks https://irma.nps.gov/DataStore/Reference/Profile/2225713 NOTE: the final number in the URL (2225713) is not always stable. If the data is missing, search through the home.

Will receive via email or FTP

[x] dot_bridgehouses
[x] dot_ferryterminals
[x] dot_mannedfacilities
[x] dot_publicparking
[x] dot_pedplazas

Unresolved process

Still waiting to figure out best way to upload these data

[x] #596 Source url: https://maps.nyc.gov/prek/data/pka/pka.csv

Last step

[x] dcp_pops Source: Download from POPs app, available on DCP Commons. Be sure to only take the public version. Be sure to do this source last, as the OpenData release of POPs needs to be in sync

mbh329 commented 1 year ago

Script is breaking for bpl_libraries, will investigate

mbh329 commented 1 year ago

I contacted DOT for the 5 datasets we receive via email, will update when I hear back from them

mbh329 commented 1 year ago

dcla_culturalinstitutions hasn't been updated since 20210625

dfta_contracts hasn't been updated since 20210423

sca_enrollment_capacity not updated since 20191230

dsny_textiledrop link no longer work, might need to take this from a kml file. See map here https://www1.nyc.gov/assets/dsny/site/services/donate-goods/textiles

dycd_afterschoolprograms not updated since 20170916

mbh329 commented 1 year ago

need to wait for updated dcp_colp data to appear on bytes to update

mbh329 commented 1 year ago

usdot_ports URL no longer works

mbh329 commented 1 year ago

@td928 Why did you change the link to the doe_lcgms? I though it's the same link as before

mbh329 commented 1 year ago

updated dcp_colp today

AmandaDoyle commented 1 year ago

Roadblock: Waiting on UPK data from DOE

fvankrieken commented 1 year ago

Just commenting for posterity - DOE UPK data had an error essentially the same as #592. In trying to figure out how to find an odd byte in VSCode (in the csv file), came across some post suggesting searching text with regex [^\x00-\x7f]. It did the trick. Looking at the line in the excel, it seemed to just be a space before the number in a cell, but clearly something weird was copied in

fvankrieken commented 1 year ago

Generated data from dev branch (minor tweaks on the branch, mainly docker config). QAQC looks relatively normal I think - main thing is a near doubling of colp facilities, do we have any context on if that should be expected?

On QCQA branch, select branch 583-data-loading-and-maintenance to see

fvankrieken commented 1 year ago

datasource

facgoup top changes

facsubgroup top changes

factype

mbh329 commented 1 year ago

@fvankrieken thats pretty frustrating with the bytes issue. Do we receive the doe_upk data as an excel file and then convert it to a csv and ingest that csv via data-library? It would be nice if we could handle these issues more robustly

fvankrieken commented 1 year ago

@fvankrieken thats pretty frustrating with the bytes issue. Do we receive the doe_upk data as an excel file and then convert it to a csv and ingest that csv via data-library? It would be nice if we could handle these issues more robustly

That's currently the workflow, yes

mbh329 commented 1 year ago

The differences in the the categories you highlighted are kind of alarming specifically when looking at something like emergency services in facgroup or FIREHOUSE in the factype. We definitely didn't build twice as many firehouses in the last year or so.

mbh329 commented 1 year ago

I do have an explanation for facsubgroup Chemical Dependency changes and the addition of Substance Use Disorder Treatment - this was an update of language so they are a "like" for "like" change. See this [PR] (https://github.com/NYCPlanning/db-facilities/pull/598) for clarification

fvankrieken commented 1 year ago

So it seems to be a column formatting issue. For colp processing, facdb expects usecode as four-digit strings. That is generally how they are formatted in downloads from bytes, but the latest one on bytes are simply ints, meaning where we would see ... ,"0520", ... in the csv we instead have ... ,520, ...

So filtering based on usecode is only working when code is > 1000

I reloaded latest public colp in data library, using script as source instead of url

import pandas as pd
from zipfile import ZipFile
import requests

from . import df_to_tempfile

class Scriptor:
    def __init__(self, **kwargs):
        self.__dict__.update(kwargs)

    @property
    def version(self):
        return self.config["dataset"]["version"]

    def ingest(self) -> pd.DataFrame:
        url = f"https://s-media.nyc.gov/agencies/dcp/assets/files/zip/data-tools/bytes/nyc_colp_csv_{self.version}.zip"
        r = requests.get(url, stream=True)
        with open(f"nyc_colp_csv_{self.version}.zip", "wb") as fd:
            for chunk in r.iter_content(chunk_size=128):
                fd.write(chunk)
        with ZipFile(f"nyc_colp_csv_{self.version}.zip", "r") as zip:
            zip.extract(f"colp_{self.version}.csv")
            df = pd.read_csv(f"colp_{self.version}.csv")
        df['USECODE'] = df['USECODE'].apply(lambda i: f'{i:04d}')
        return df

    def runner(self) -> str:
        df = self.ingest()
        local_path = df_to_tempfile(df)
        return local_path

Not sure if we want to actually use this in general as this is more a data issue that needs to be fixed in my mind (and therefore should give us some sort of error), but maybe we should explicitly throw an error on data library step if this column is misformatted

fvankrieken commented 1 year ago

But new issues now:

fvankrieken commented 1 year ago

So that's resolved - issues with formatting from reading in files via script rather than simply url in data library (a couple columns got coerced to float, leading "1" to be saved as "1.0" which still got read in as a string by facdb causing similar issues.

So at this point looking better

A lot of additions in prek it seems which seems potentially a bit odd. Would love others' thoughts here. Also, a huge gain in scrap metal processing seems a bit odd as well

AmandaDoyle commented 1 year ago

Not sure if we want to actually use this in general as this is more a data issue that needs to be fixed in my mind (and therefore should give us some sort of error), but maybe we should explicitly throw an error on data library step if this column is misformatted

I'm all for making things fail loudly upstream. We can build this into data library or whatever would be easiest for you to make build processes easier. Other ideas are to make the filtering more flexible so that it converts the field in COLP to an integer and filters on numeric values. Upstream data changes are a chronic issue.

Summarizing what we discussed for posterity. Next steps and things to look into.

looking into and potentially rationalizing why the number of records increased from the following 4 sources: doe_universalprek (likely because first time we received new data in a few years), dca_operatingbusinesses (particularly scrap metal), dohmh_daycare, nycomh_mentalhealth Generally, gaining facilities is less alarming than losing facilities. We haven't lost facilities and our hit rate for mapping facilities is consistent, so that's great).

fvankrieken commented 1 year ago

Scrap metal I think is okay - from spot checking, it seems that many of the new ones were classified as something else in the 2nd latest dca_operatingbusinesses, "Electronics store", "Secondhand dealer - general" (and whose facnames are things like "Scrap King"), etc. We're not seeing missing rows (as in -50 secondhand dealers) because these get filtered out in the python processing part of the step and aren't included

fvankrieken commented 1 year ago

EMERGENCY -> EMERGENCY/CRISIS change is from nysome data source, column program_category_description

https://data.ny.gov/Human-Services/Local-Mental-Health-Programs/6nvr-tbv8

Don't see any information there about the change. Should we do anything in this case?

fvankrieken commented 1 year ago

Re prek - the last build used data from 2021 so it's been a little while. this query (locally with built facdb) returns new sites based on bins not being in the last facdb (from the universal_prek datasource at least)

select a.facname 
from facdb a 
left join dcp_facilities_with_unmapped b 
    on a.bin::text = b.bin and b.datasource = 'doe_universalprek' 
where a.datasource='doe_universalprek' and b.bin is  null;

Glancing through, they all seem real. Of the 2027 rows from doe_universalprek, we have 1985 distinct bins. Seems sometimes there are multiple schools at the same location

AmandaDoyle commented 1 year ago

EMERGENCY -> EMERGENCY/CRISIS change is from nysome data source, column program_category_description https://data.ny.gov/Human-Services/Local-Mental-Health-Programs/6nvr-tbv8 Don't see any information there about the change. Should we do anything in this case?

As long as this new facility type is being assigned to a subgroup, there's nothing for us to do in this case besides take note of the new facility types. If a new facility type is not being assigned to a subgroup, we'll need to assign it to one. Given that facility types are meant to reflect how agencies refer to a facility, we don't like to change them and prefer to take whatever is in the source data.

fvankrieken commented 1 year ago

Gotcha. Groups and subgroups are consistent for these from previous builds

damonmcc commented 1 year ago

outputs currently in review

NYCPlanning / db-facilities

22Q4 Facilities DB Data loading #583

FacDB Source Data Updates

Scraped by data library

Source data from OpenData

Manually check data for updates

Manual download

Will receive via email or FTP

Unresolved process

Last step