NYCPlanning / db-developments

🏠 🏘️ 🏗️ Developments Database
https://nycplanning.github.io/db-developments
8 stars 2 forks source link

aggregate tract 2020 export failed #601

Open td928 opened 2 years ago

td928 commented 2 years ago

the log for github actions suggests that this failed at which indicates the aggregate table from the sql/tract.sql step has non unique values in the bct2020 column

    aggregate.set_index(idx, inplace=True, verify_integrity=True)
td928 commented 2 years ago

After some research, the issue is narrowed down to the group by query for the tract.sql specifically

GROUP BY
    YEARLY_devdb_{{ decade }}.boro,
    YEARLY_devdb_{{ decade }}.bct{{ decade }},
    YEARLY_devdb_{{ decade }}.centract{{ decade }}

It turns out that for some record in the yearly_devdb_2020 table the centract2020 is sometimes filled out but also sometimes null therefore causing multiple rows for a single bct2020 which supposed to be the index row for dataframes to be joined on.

It is somewhat unexpected because both of those fields are populated together in the _spatial.sql and should always have value as long as the other have value. But the logic to create them is quite complex involving a multiple-way join between a few tables:

DROP TABLE IF EXISTS SPATIAL_devdb;
SELECT
    DRAFT_spatial.uid,
    DRAFT_spatial.geo_bbl,
    DRAFT_spatial.geo_bin,
    DRAFT_spatial.geo_address_numbr,
    DRAFT_spatial.geo_address_street,
    DRAFT_spatial.geo_address,
    DRAFT_spatial.geo_zipcode,
    DRAFT_spatial.geo_boro,
    DRAFT_spatial.geo_csd,
    DRAFT_spatial.geo_cd,
    DRAFT_spatial.geo_council,
    DRAFT_spatial.geo_policeprct,
    DRAFT_spatial.geo_firedivision,
    DRAFT_spatial.geo_firebattalion,
    DRAFT_spatial.geo_firecompany,
    DRAFT_spatial.geo_schoolelmntry,
    DRAFT_spatial.geo_schoolmiddle,
    DRAFT_spatial.geo_schoolsubdist,
    DRAFT_spatial.geo_latitude,
    DRAFT_spatial.geo_longitude,
    DRAFT_spatial.latitude,
    DRAFT_spatial.longitude,
    DRAFT_spatial.geom,
    DRAFT_spatial.geomsource,
    CENSUS_TRACT_BLOCK.fips||DRAFT_spatial._geo_ct2010||DRAFT_spatial._geo_cb2010 as geo_cb2010,
    CENSUS_TRACT_BLOCK.fips||DRAFT_spatial._geo_ct2010 as geo_ct2010,
    CENSUS_TRACT_BLOCK.bctcb2010,
    CENSUS_TRACT_BLOCK.bct2010,
    CENSUS_TRACT_BLOCK.fips||DRAFT_spatial._geo_ct2020||DRAFT_spatial._geo_cb2020 as geo_cb2020,
    CENSUS_TRACT_BLOCK.fips||DRAFT_spatial._geo_ct2020 as geo_ct2020,
    CENSUS_TRACT_BLOCK.bctcb2020,
    CENSUS_TRACT_BLOCK.bct2020,
    dcp_ct2010.ntacode as geo_nta2010,
    dcp_ct2010.ntaname as geo_ntaname2010,
    dcp_ct2020.nta2020 as geo_nta2020,
    dcp_ct2020.ntaname as geo_ntaname2020,
    dcp_ct2020.cdta2020 as geo_cdta2020,
    dcp_ct2020.cdtaname as geo_cdtaname2020
INTO SPATIAL_devdb
FROM DRAFT_spatial
LEFT JOIN CENSUS_TRACT_BLOCK ON DRAFT_spatial.uid = CENSUS_TRACT_BLOCK.uid
LEFT JOIN dcp_ct2010 ON CENSUS_TRACT_BLOCK.bct2010 = boroct2010
LEFT JOIN dcp_ct2020 ON CENSUS_TRACT_BLOCK.bct2020 = boroct2020;