Closed damonmcc closed 1 year ago
@athursland and @DeaBardhoshi see running list of notes below to guide work for Thursday's meeting. cc @fvankrieken (please feel free to add)
[x] What is the universe of the Checkbook NYC data?
How many rows?
What is a row? Or what does a record represent?
What are the columns?
Number of unique projects? (i.e. number of units project ids)
[x] General summary statistics of the data.
Breakdown by agency (number of projects, $s spent)
Spending over time
[ ] Amount of data that can be mapped as both number of projects and $s and % of projects and % of $s, and any other summary stats (i.e. by agency, by year, or by category if you were able to categorize them into 'Fixed Asset' or?)
Overview of method(s) for adding spatial data for projects
Overview of spatial data by source (i.e. Facilities Database, Parks Properties, DDC or?
I think the next steps are to:
Happy to collaborate however is most helpful!
Next steps from 06/29 meeting 1) Collapse checkbook NYC data to the project level to make it easier to work with (~926,562 records). Do not clean data, as it was indicated that the checks we thought were odd may be valid. 2) Familiarize yourselves with the Capital Projects Database aka CPDB 3) Run Checkbook NYC projects through CPDB categorization process and create report of number of projects in each category and number of $ in each category (1) ITT, Vehicles, and Equipment 2) Lump Sum 3) Fixed Asset or 4) NULL). I'm happy to help walk through the logic. 4) Join Checkbook NYC data onto CPDB. I think that this will best be done / setup with another DE team member to help get it going. 5) Report back results and discuss potential next steps. Future work may focus on specific agencies, or?
Morning! Did some digging to 1) help collapse records and 2) join Checkbook NYC onto CPDB
To join checkbook NYC data onto CPDB it appears we need to remove the last three digits and any trailing white space from the Capital Project
value. For example, if I search 998CAP2024 005
in CPDB here I do not get a result, but if I search 998CAP2024
I do get a match. Same thing with 841HWMBRT5 801
vs 841HWMBRT5
.
Given this, I recommend
1) cleaning the Capital Project
values by removing the last three digits and trailing whitespace and renaming the column to FMS ID
2) Group by the following columns: Agency, Budget Code, FMS ID
(old Capital Project` value),
SUM(Check Amount), Expense Category, Fiscal year, Spending Category. You can drop all other columns.
@athursland and @DeaBardhoshi for tracking and note taking For cleaning the CheckbookNYC data please remove negative checks and checks with a value equal to $99,999,999
Going to try joining on cpdb. Jumping off of our conversation during stand-up, is the Digital Ocean file 23_Q2_build/latest/output/cpdb_projects.csv the right file to grab for now? @fvankrieken @damonmcc
Yes, but with the caveat of no geometries in that file. The shapefiles in that folder have geometries, but obviously need to be converted into something you can use. I think you could use gdal/ogr2ogr for that but there's some overhead in that as well. Or actually I guess you could pull those into geopandas, if you're using geopandas already?
Otherwise if it's easier for you, you could point a query at edm data. You'll need a new connection string to supply to sqlalchemy's create_engine
function, in the format postgresql://{user}:{password}@{edm_data_url}:25060/defaultdb - args for that can be found with the rest of our secrets
Schema cpdb, table cpdb_opendata_projects_pts
or cpdb_opendata_projects_poly
(for points or polygons) I believe
I was planning on using geopandas but I think querying the data would probably be a better exercise! Will reach out if I have any roadblocks
Whenever convenient, can someone provide guidance on where to grab archived CPDB with geometries?
@athursland if you have access to the edm-publishing
S3 bucket, I'm seeing some versions in db-cbdp/main/
. but that only goes back to 2021
I'll keep looking for conveniently-located older versions
Notes on this sprint's work, @DeaBardhoshi feel free to fill in details on the categorization work:
FMS ID
column in Checkbook NYC per Amanda's instructions, and filtered to include only rows where Check Amount
< 99,000,000 and >= 0. Then we grouped the data by FMS ID
, Budget Code
, Agency
, Fiscal year
, Expense Category
and Spending Category
, and aggregated the sum of Check Amount
for those groups. Where there were different possible values for the grouping categories, we arbitrarily chose the first one. Resulting dataset contains 80,932 rows.
group_checkbooknyc.ipynb
Budget Code
column. Explored ways to improve these assignments by adding key words to SQL query and running on Contract Purpose
category as well.
Possible extensions:
Further directions/ideas:
Notes from Tuesday 7/11:
FMS ID
and instead concatenating all non-null values of Budget Code
, Contract Purpose
, and Agency
within those groups. Fiscal year
and all other columns are being dropped. Might return to Fiscal year
for future temporal analysis but outside of the scope of the deliverables we need for Capital Planning. Notes from Wednesday 7/12:
FMS ID
has 16,678 rows, each representing a single project. When the adopted CPDB shape files 2017-2022 and the executive CPDB shape file for 2023 are left joined onto Checkbook NYC, 3,880 projects map to a geometry. final_category
based on only bc_category
and cp_category
in the cases where Checkbook NYC projects did not join to a CPDB geometry, and therefore can't have a cpdb_category
value. Now every row in the final dataset should have a value for final_category
regardless of whether it has a geometry from CPDB. Notes on odd ends of this sprint:
Some notes on further categorization directions with the Checkbook NYC data:
So for norms around PRs:
Note for YAML file for reading in CPDB geoms: - currently uploading only the shapefile to DO, but we want to update this to include the whole CPDB geom subdirectory with all the relevant files @DeaBardhoshi
Notes on results from unit testing:
COLUMN
and col_category
terms with the column of interest, ref here: https://stackoverflow.com/questions/19314342/python-sqlalchemy-pass-parameters-in-connection-executeIdeas for extending testing:
Observations from exploring laying in parks properties geometries:
Notes about potential enhancements (running) :
agency
/budget_code
/contract_purpose
into a ';' delimited string we should absolutely be checking for duplicates and we aren't doing that currently
a place for links and discussions for fellows @DeaBardhoshi and @athursland
general links
Historical Capital Spending