Open AmandaDoyle opened 1 year ago
Script is breaking for bpl_libraries
, will investigate
I contacted DOT for the 5 datasets we receive via email, will update when I hear back from them
dcla_culturalinstitutions hasn't been updated since 20210625
dfta_contracts hasn't been updated since 20210423
sca_enrollment_capacity not updated since 20191230
dsny_textiledrop link no longer work, might need to take this from a kml file. See map here https://www1.nyc.gov/assets/dsny/site/services/donate-goods/textiles
dycd_afterschoolprograms not updated since 20170916
need to wait for updated dcp_colp data to appear on bytes to update
usdot_ports URL no longer works
@td928 Why did you change the link to the doe_lcgms? I though it's the same link as before
updated dcp_colp today
Roadblock: Waiting on UPK data from DOE
Just commenting for posterity - DOE UPK data had an error essentially the same as #592. In trying to figure out how to find an odd byte in VSCode (in the csv file), came across some post suggesting searching text with regex [^\x00-\x7f]
. It did the trick. Looking at the line in the excel, it seemed to just be a space before the number in a cell, but clearly something weird was copied in
Generated data from dev branch (minor tweaks on the branch, mainly docker config). QAQC looks relatively normal I think - main thing is a near doubling of colp facilities, do we have any context on if that should be expected?
On QCQA branch, select branch 583-data-loading-and-maintenance
to see
datasource
facgoup top changes
facsubgroup top changes
factype
@fvankrieken thats pretty frustrating with the bytes issue. Do we receive the doe_upk data as an excel file and then convert it to a csv and ingest that csv via data-library? It would be nice if we could handle these issues more robustly
@fvankrieken thats pretty frustrating with the bytes issue. Do we receive the doe_upk data as an excel file and then convert it to a csv and ingest that csv via data-library? It would be nice if we could handle these issues more robustly
That's currently the workflow, yes
The differences in the the categories you highlighted are kind of alarming specifically when looking at something like emergency services
in facgroup
or FIREHOUSE
in the factype
. We definitely didn't build twice as many firehouses in the last year or so.
I do have an explanation for facsubgroup Chemical Dependency
changes and the addition of Substance Use Disorder Treatment
- this was an update of language so they are a "like" for "like" change. See this [PR] (https://github.com/NYCPlanning/db-facilities/pull/598) for clarification
So it seems to be a column formatting issue. For colp processing, facdb expects usecode
as four-digit strings. That is generally how they are formatted in downloads from bytes, but the latest one on bytes are simply ints, meaning where we would see ... ,"0520", ...
in the csv we instead have ... ,520, ...
So filtering based on usecode
is only working when code is > 1000
I reloaded latest public colp in data library, using script as source instead of url
import pandas as pd
from zipfile import ZipFile
import requests
from . import df_to_tempfile
class Scriptor:
def __init__(self, **kwargs):
self.__dict__.update(kwargs)
@property
def version(self):
return self.config["dataset"]["version"]
def ingest(self) -> pd.DataFrame:
url = f"https://s-media.nyc.gov/agencies/dcp/assets/files/zip/data-tools/bytes/nyc_colp_csv_{self.version}.zip"
r = requests.get(url, stream=True)
with open(f"nyc_colp_csv_{self.version}.zip", "wb") as fd:
for chunk in r.iter_content(chunk_size=128):
fd.write(chunk)
with ZipFile(f"nyc_colp_csv_{self.version}.zip", "r") as zip:
zip.extract(f"colp_{self.version}.csv")
df = pd.read_csv(f"colp_{self.version}.csv")
df['USECODE'] = df['USECODE'].apply(lambda i: f'{i:04d}')
return df
def runner(self) -> str:
df = self.ingest()
local_path = df_to_tempfile(df)
return local_path
Not sure if we want to actually use this in general as this is more a data issue that needs to be fixed in my mind (and therefore should give us some sort of error), but maybe we should explicitly throw an error on data library step if this column is misformatted
But new issues now:
So that's resolved - issues with formatting from reading in files via script rather than simply url in data library (a couple columns got coerced to float, leading "1" to be saved as "1.0" which still got read in as a string by facdb causing similar issues.
So at this point looking better
A lot of additions in prek it seems which seems potentially a bit odd. Would love others' thoughts here. Also, a huge gain in scrap metal processing seems a bit odd as well
Not sure if we want to actually use this in general as this is more a data issue that needs to be fixed in my mind (and therefore should give us some sort of error), but maybe we should explicitly throw an error on data library step if this column is misformatted
I'm all for making things fail loudly upstream. We can build this into data library or whatever would be easiest for you to make build processes easier. Other ideas are to make the filtering more flexible so that it converts the field in COLP to an integer and filters on numeric values. Upstream data changes are a chronic issue.
Summarizing what we discussed for posterity. Next steps and things to look into.
Scrap metal I think is okay - from spot checking, it seems that many of the new ones were classified as something else in the 2nd latest dca_operatingbusinesses, "Electronics store", "Secondhand dealer - general" (and whose facnames are things like "Scrap King"), etc. We're not seeing missing rows (as in -50 secondhand dealers) because these get filtered out in the python processing part of the step and aren't included
EMERGENCY -> EMERGENCY/CRISIS change is from nysome data source, column program_category_description
https://data.ny.gov/Human-Services/Local-Mental-Health-Programs/6nvr-tbv8
Don't see any information there about the change. Should we do anything in this case?
Re prek - the last build used data from 2021 so it's been a little while. this query (locally with built facdb) returns new sites based on bins not being in the last facdb (from the universal_prek datasource at least)
select a.facname
from facdb a
left join dcp_facilities_with_unmapped b
on a.bin::text = b.bin and b.datasource = 'doe_universalprek'
where a.datasource='doe_universalprek' and b.bin is null;
Glancing through, they all seem real. Of the 2027 rows from doe_universalprek, we have 1985 distinct bins. Seems sometimes there are multiple schools at the same location
EMERGENCY -> EMERGENCY/CRISIS change is from nysome data source, column program_category_description https://data.ny.gov/Human-Services/Local-Mental-Health-Programs/6nvr-tbv8 Don't see any information there about the change. Should we do anything in this case?
As long as this new facility type is being assigned to a subgroup, there's nothing for us to do in this case besides take note of the new facility types. If a new facility type is not being assigned to a subgroup, we'll need to assign it to one. Given that facility types are meant to reflect how agencies refer to a facility, we don't like to change them and prefer to take whatever is in the source data.
Gotcha. Groups and subgroups are consistent for these from previous builds
outputs currently in review
FacDB Source Data Updates
Like most of our data products, source data must be updated in data library before FacDB is run. As there are are many source datasets with varied update processes, this issue template should be opened to track progress towards updating all source data
All source data listed is to be uploaded as .csv files
Scraped by data library
[x] bpl_libraries Source: Scraped from BPL website Source url: https://www.bklynlibrary.org/locations/json
[x] nypl_libraries Source: Scrape from NYPL website Source url: https://www.nypl.org/locations/list
[x] uscourts_courts Source: Court locator for NY state Source url: http://www.uscourts.gov/court-locator/city/New%20York/state/NY
Source data from OpenData
To see if a dataset needs to be uploaded, check date last updated in open data against version in data library
[x] dca_operatingbusinesses https://data.cityofnewyork.us/Business/Legally-Operating-Businesses/w7w3-xahh
[x] dcp_colp https://www1.nyc.gov/site/planning/data-maps/open-data.page#city_facilities
[x] dcla_culturalinstitutions https://data.cityofnewyork.us/Recreation/DCLA-Cultural-Organizations/u35m-9t32
[x] dfta_contracts https://data.cityofnewyork.us/Social-Services/DFTA-Contracts/6j6t-3ixh
[x] doe_busroutesgarages https://data.cityofnewyork.us/Transportation/Routes/8yac-vygm
[x] sca_enrollment_capacity https://data.cityofnewyork.us/Education/Enrollment-Capacity-And-Utilization-Reports-Target/8b9a-pywy
[x] dohmh_daycare https://data.cityofnewyork.us/Health/DOHMH-Childcare-Center-Inspections/dsg6-ifza
[x] dpr_parksproperties https://nycopendata.socrata.com/Recreation/Parks-Properties/enfh-gkve NOTE: DPR open data table URLs are not consistent. Be sure to double-check before running from the recipes app.
[x] dsny_garages https://data.cityofnewyork.us/Environment/DSNY-Garages/xw3j-2yxf
[x] dsny_specialwastedrop https://data.cityofnewyork.us/Environment/DSNY-Special-Waste-Drop-off-Sites/242c-ru4i
[ ]
dsny_textiledrop https://data.cityofnewyork.us/Environment/Textile-Drop-Off-Locations-in-NYC/qnjm-wvu5[x] dsny_donatenycdirectory https://data.cityofnewyork.us/Environment/DSNY-DonateNYC-Directory/gkgs-za6m
[x] dsny_leafdrop https://data.cityofnewyork.us/Environment/Leaf-Drop-Off-Locations-in-NYC/8i9k-4gi5
[x] dsny_fooddrop https://data.cityofnewyork.us/Environment/Food-Scrap-Drop-Off-Locations-in-NYC/if26-z6xq
[x] dsny_electronicsdrop https://data.cityofnewyork.us/Environment/Electronics-Drop-Off-Locations-in-NYC/wshr-5vic
[x] dycd_afterschoolprograms https://data.cityofnewyork.us/Education/DYCD-after-school-programs/mbd7-jfnc
[x] fdny_firehouses https://data.cityofnewyork.us/Public-Safety/FDNY-Firehouse-Listing/hc8x-tcnd
[x] hhc_hospitals https://data.cityofnewyork.us/Health/Health-and-Hospitals-Corporation-HHC-Facilities/f7b6-v6v3
[x] hra_jobcenters https://data.cityofnewyork.us/Business/Directory-Of-Job-Centers/9d9t-bmk7
[x] hra_medicaid https://data.cityofnewyork.us/City-Government/Medicaid-Offices/ibs4-k445
[x] hra_snapcenters https://data.cityofnewyork.us/Social-Services/Directory-of-SNAP-Centers/tc6u-8rnp
[x] moeo_socialservicesitelocations https://data.cityofnewyork.us/City-Government/Verified-Locations-for-NYC-City-Funded-Social-Serv/2bvn-ky2h
[x] nycha_communitycenters https://data.cityofnewyork.us/Social-Services/Directory-of-NYCHA-Community-Facilities/crns-fw6u
[x] nycha_policeservice https://data.cityofnewyork.us/Housing-Development/NYCHA-PSA-Police-Service-Areas-/72wx-vdjr
[x] nysdec_solidwaste https://data.ny.gov/Energy-Environment/Solid-Waste-Management-Facilities/2fni-raj8
[x] nysdoh_healthfacilities https://health.data.ny.gov/Health/Health-Facility-General-Information/vn5v-hh5r
[x] nysdoh_nursinghomes https://health.data.ny.gov/Health/Nursing-Home-Weekly-Bed-Census-Last-Submission/izta-vnpq
[x] nysomh_mentalhealth https://data.ny.gov/Human-Services/Local-Mental-Health-Programs/6nvr-tbv8
[x] nysopwdd_providers https://data.ny.gov/Human-Services/Directory-of-Developmental-Disabilities-Service-Pr/ieqx-cqyk
[x] nysparks_historicplaces https://data.ny.gov/Recreation/National-Register-of-Historic-Places/iisn-hnyv
[x] nysparks_parks https://data.ny.gov/Recreation/State-Park-Facility-Points/9uuk-x7vh
[x] qpl_libraries https://data.cityofnewyork.us/Education/Queens-Library-Branches/kh3d-xhq7
[x] sbs_workforce1 https://data.cityofnewyork.us/dataset/Center-Service-Locations/6smc-7mk6
[x] usdot_airports https://hub.arcgis.com/datasets/usdot::airports Head to url >> api >> copy url from geojson
[x] usdot_ports https://hub.arcgis.com/datasets/usdot::ports Head to url >> api >> copy url from geojson
[x] nysdec_lands http://gis.ny.gov/gisdata/inventories/details.cfm?DSID=1114
Manually check data for updates
These don't report date updated as neatly as the open datasets, have to look at data itself
[x] fbop_corrections https://www.bop.gov/locations/list.jsp When searching by state, there should be 5 NY prisons, 3 of which are in NYC (Brooklyn/New York)
[x] nycdoc_corrections https://www1.nyc.gov/site/doc/about/facilities-locations.page Source: NYCDOC locations directory
[x] nycourts_courts http://www.nycourts.gov/courts/nyc/criminal/generalinfo.shtml#BRONX_COUNTY
[x] nysdoccs_corrections https://doccs.ny.gov/find-facility Hand check for 1 facility in queens, 1 facility in Manhattan, 0 in the other 3 boros. Only look at the correctional facility locations, not the offices.
Manual download
[x] doe_lcgms https://data.cityofnewyork.us/Education/LCGMS-DOE-School-Information-Report/3bkj-34v2 This dataset is updated for CEQR
[x] foodbankny_foodbanks http://www.foodbanknyc.org/get-help/ Go to the expanded view of the google maps. Click “Download KML” under the options (three dots). Instead of “Entire Map,” select “Food Bank For NYC Open Sites.” Select. “Keep data up to date with network link KML (only usable online).“ Go to https://mygeodata.cloud/converter/kmz-to-csv to convert the kmz to csv, then use recipe app to load in the csv
[x] nysed_activeinstitutions https://eservices.nysed.gov/sedreports/list?id=1 Active Institutions with GIS coordinates and OITS Accuracy Code - Select by County__ CSV. Note that .csv data is automatically downloaded without comma delimiter. Exporting to csv from numbers is one way to get around this issue
[x] nysoasas_programs https://webapps.oasas.ny.gov/providerDirectory/index.cfm?search_type=2 Download all treatment providers Modify download URL to contain today’s date: https://webapps.oasas.ny.gov/providerDirectory/download/Treatment_Providers_OASAS_Directory_Search_13-Nov-20.csv <- this url needs to be updated to programs not providers https://webapps.oasas.ny.gov/providerDirectory/download/Treatment_Programs_OASAS_Directory_Search_13-Dec-22.csv
[x] usnps_parks https://irma.nps.gov/DataStore/Reference/Profile/2225713 NOTE: the final number in the URL (2225713) is not always stable. If the data is missing, search through the home.
Will receive via email or FTP
Unresolved process
Still waiting to figure out best way to upload these data
Last step