NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
19 stars 0 forks source link

compare CBBR datasests #1005

Closed damonmcc closed 1 week ago

damonmcc commented 1 month ago

background

We plan to make our CBBR data product public. During the 'Making CBBR data public' meeting on 7/8, we learned about the dataset published by OMB: Register of Community Board Budget Requests.

As a first step in determining the best way to make the CBBR data public DE will compare three datasets: A. What Planning support delivers to Data Engineering to create CBBR B. the final CBBR output created by DE C. the dataset published by OMB

goals

Write a description of the three datasets, how they're related, and how they're different. CBBR's wiki may be a good place to put a final draft of these details.

HengJiang0206 commented 1 month ago

cbbr_submissions, the dataset delivered to Data Engineering by Planning Support, has 31 columns, with each row identifying a budget request. This submission contains information on the request's source, contents, geographies, and status. The sources of the requests include the names of the borough and the board; the contents of the requests include priorities, types, textual descriptions of the request and reasons, supported entities, project ID, and budget line references. The geographies of the requests include street names, addresses, locations (site names), etc. Locations, street names, and addresses could take various formats and overlap. The requests' statuses include the names and codes of the agencies responding to the request, the content and category of the response, and any additional comments.

CDNeeds_CBBRs_ALL_Archive, the output dataset by Data Engineering, has 24 columns, with each row identifying a budget request. It also contains information on the sources, contents, geographies, and statuses of the requests but has a few columns modified or removed compared with cbbr_submissions. For sources, it dropped the boro_and_board column, which is a concatenation of the borough codes and board codes. For the geographies, it kept the location column and dropped other geographic columns such as street_name, cross_street_1, cross_street_2, facility_or_park_name, and address. For the status columns, agency_name was renamed to agency_acronym, and the agency_reponse_code column was dropped.

Register of Community Board Budget Requests published by the Mayor's Office of Management & Budget (OMB) on NYC Open Data has 26 columns, with each row identifying a budget request. Compared with cbbr_submissions, the publication added a Publication column, which records the budget publication date. For sources, it dropped the original full borough names to only keep the borough codes (renamed to borough); It kept the community board number (renamed to community_board) and dropped the cb_label; it also dropped the type column. For contents, it kept all the columns and renamed some of them. For statuses, it kept the response, response agency full name, and response agency acronym. It dropped the agency code, response code, response category, and category description. For geographies, it dropped facility_or_park_name and location; It includes multiple additional geographic columns, such as Block, Lot, Postcode, Latitude, longitude, Council District, BIN, BBL, Census Tract, and Neighborhood Tabulation Area (NTA).