NYCPlanning / db-facilities

🏭 🏢 🏬 🏣 🏤 🏥 🏦 🏨 🏪 🏫 🏩
https://nycplanning.github.io/db-facilities
0 stars 0 forks source link

22Q4 Facilities DB Data loading #583

Open AmandaDoyle opened 1 year ago

AmandaDoyle commented 1 year ago

FacDB Source Data Updates

Like most of our data products, source data must be updated in data library before FacDB is run. As there are are many source datasets with varied update processes, this issue template should be opened to track progress towards updating all source data

All source data listed is to be uploaded as .csv files

Scraped by data library

Source data from OpenData

To see if a dataset needs to be uploaded, check date last updated in open data against version in data library

Manually check data for updates

These don't report date updated as neatly as the open datasets, have to look at data itself

Manual download

Will receive via email or FTP

Unresolved process

Still waiting to figure out best way to upload these data

Last step

mbh329 commented 1 year ago

Script is breaking for bpl_libraries, will investigate

mbh329 commented 1 year ago

I contacted DOT for the 5 datasets we receive via email, will update when I hear back from them

mbh329 commented 1 year ago

dcla_culturalinstitutions hasn't been updated since 20210625

dfta_contracts hasn't been updated since 20210423

sca_enrollment_capacity not updated since 20191230

dsny_textiledrop link no longer work, might need to take this from a kml file. See map here https://www1.nyc.gov/assets/dsny/site/services/donate-goods/textiles

dycd_afterschoolprograms not updated since 20170916

mbh329 commented 1 year ago

need to wait for updated dcp_colp data to appear on bytes to update

mbh329 commented 1 year ago

usdot_ports URL no longer works

mbh329 commented 1 year ago

@td928 Why did you change the link to the doe_lcgms? I though it's the same link as before

mbh329 commented 1 year ago

updated dcp_colp today

AmandaDoyle commented 1 year ago

Roadblock: Waiting on UPK data from DOE

fvankrieken commented 1 year ago

Just commenting for posterity - DOE UPK data had an error essentially the same as #592. In trying to figure out how to find an odd byte in VSCode (in the csv file), came across some post suggesting searching text with regex [^\x00-\x7f]. It did the trick. Looking at the line in the excel, it seemed to just be a space before the number in a cell, but clearly something weird was copied in

fvankrieken commented 1 year ago

Generated data from dev branch (minor tweaks on the branch, mainly docker config). QAQC looks relatively normal I think - main thing is a near doubling of colp facilities, do we have any context on if that should be expected?

On QCQA branch, select branch 583-data-loading-and-maintenance to see

fvankrieken commented 1 year ago

datasource image

facgoup top changes image

facsubgroup top changes image

factype image

mbh329 commented 1 year ago

@fvankrieken thats pretty frustrating with the bytes issue. Do we receive the doe_upk data as an excel file and then convert it to a csv and ingest that csv via data-library? It would be nice if we could handle these issues more robustly

fvankrieken commented 1 year ago

@fvankrieken thats pretty frustrating with the bytes issue. Do we receive the doe_upk data as an excel file and then convert it to a csv and ingest that csv via data-library? It would be nice if we could handle these issues more robustly

That's currently the workflow, yes

mbh329 commented 1 year ago

The differences in the the categories you highlighted are kind of alarming specifically when looking at something like emergency services in facgroup or FIREHOUSE in the factype. We definitely didn't build twice as many firehouses in the last year or so.

mbh329 commented 1 year ago

I do have an explanation for facsubgroup Chemical Dependency changes and the addition of Substance Use Disorder Treatment - this was an update of language so they are a "like" for "like" change. See this [PR] (https://github.com/NYCPlanning/db-facilities/pull/598) for clarification

fvankrieken commented 1 year ago

So it seems to be a column formatting issue. For colp processing, facdb expects usecode as four-digit strings. That is generally how they are formatted in downloads from bytes, but the latest one on bytes are simply ints, meaning where we would see ... ,"0520", ... in the csv we instead have ... ,520, ...

So filtering based on usecode is only working when code is > 1000

I reloaded latest public colp in data library, using script as source instead of url

import pandas as pd
from zipfile import ZipFile
import requests

from . import df_to_tempfile

class Scriptor:
    def __init__(self, **kwargs):
        self.__dict__.update(kwargs)

    @property
    def version(self):
        return self.config["dataset"]["version"]

    def ingest(self) -> pd.DataFrame:
        url = f"https://s-media.nyc.gov/agencies/dcp/assets/files/zip/data-tools/bytes/nyc_colp_csv_{self.version}.zip"
        r = requests.get(url, stream=True)
        with open(f"nyc_colp_csv_{self.version}.zip", "wb") as fd:
            for chunk in r.iter_content(chunk_size=128):
                fd.write(chunk)
        with ZipFile(f"nyc_colp_csv_{self.version}.zip", "r") as zip:
            zip.extract(f"colp_{self.version}.csv")
            df = pd.read_csv(f"colp_{self.version}.csv")
        df['USECODE'] = df['USECODE'].apply(lambda i: f'{i:04d}')
        return df

    def runner(self) -> str:
        df = self.ingest()
        local_path = df_to_tempfile(df)
        return local_path

Not sure if we want to actually use this in general as this is more a data issue that needs to be fixed in my mind (and therefore should give us some sort of error), but maybe we should explicitly throw an error on data library step if this column is misformatted

fvankrieken commented 1 year ago

But new issues now:

image
fvankrieken commented 1 year ago

So that's resolved - issues with formatting from reading in files via script rather than simply url in data library (a couple columns got coerced to float, leading "1" to be saved as "1.0" which still got read in as a string by facdb causing similar issues.

So at this point looking better image

A lot of additions in prek it seems which seems potentially a bit odd. Would love others' thoughts here. Also, a huge gain in scrap metal processing seems a bit odd as well

image

AmandaDoyle commented 1 year ago

Not sure if we want to actually use this in general as this is more a data issue that needs to be fixed in my mind (and therefore should give us some sort of error), but maybe we should explicitly throw an error on data library step if this column is misformatted

I'm all for making things fail loudly upstream. We can build this into data library or whatever would be easiest for you to make build processes easier. Other ideas are to make the filtering more flexible so that it converts the field in COLP to an integer and filters on numeric values. Upstream data changes are a chronic issue.

Summarizing what we discussed for posterity. Next steps and things to look into.

fvankrieken commented 1 year ago

Scrap metal I think is okay - from spot checking, it seems that many of the new ones were classified as something else in the 2nd latest dca_operatingbusinesses, "Electronics store", "Secondhand dealer - general" (and whose facnames are things like "Scrap King"), etc. We're not seeing missing rows (as in -50 secondhand dealers) because these get filtered out in the python processing part of the step and aren't included

fvankrieken commented 1 year ago

EMERGENCY -> EMERGENCY/CRISIS change is from nysome data source, column program_category_description

https://data.ny.gov/Human-Services/Local-Mental-Health-Programs/6nvr-tbv8

Don't see any information there about the change. Should we do anything in this case?

fvankrieken commented 1 year ago

Re prek - the last build used data from 2021 so it's been a little while. this query (locally with built facdb) returns new sites based on bins not being in the last facdb (from the universal_prek datasource at least)

select a.facname 
from facdb a 
left join dcp_facilities_with_unmapped b 
    on a.bin::text = b.bin and b.datasource = 'doe_universalprek' 
where a.datasource='doe_universalprek' and b.bin is  null;

Glancing through, they all seem real. Of the 2027 rows from doe_universalprek, we have 1985 distinct bins. Seems sometimes there are multiple schools at the same location

image

AmandaDoyle commented 1 year ago

EMERGENCY -> EMERGENCY/CRISIS change is from nysome data source, column program_category_description https://data.ny.gov/Human-Services/Local-Mental-Health-Programs/6nvr-tbv8 Don't see any information there about the change. Should we do anything in this case?

As long as this new facility type is being assigned to a subgroup, there's nothing for us to do in this case besides take note of the new facility types. If a new facility type is not being assigned to a subgroup, we'll need to assign it to one. Given that facility types are meant to reflect how agencies refer to a facility, we don't like to change them and prefer to take whatever is in the source data.

fvankrieken commented 1 year ago

Gotcha. Groups and subgroups are consistent for these from previous builds

damonmcc commented 1 year ago

outputs currently in review