NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
25 stars 1 forks source link

Data Engineering fellows notes #23

Closed damonmcc closed 1 year ago

damonmcc commented 1 year ago

a place for links and discussions for fellows @DeaBardhoshi and @athursland

general links

Historical Capital Spending

AmandaDoyle commented 1 year ago

@athursland and @DeaBardhoshi see running list of notes below to guide work for Thursday's meeting. cc @fvankrieken (please feel free to add)

I think the next steps are to:

Happy to collaborate however is most helpful!

AmandaDoyle commented 1 year ago

https://github.com/NYCPlanning/edm-overview/issues/819#issuecomment-1610085125

AmandaDoyle commented 1 year ago

Next steps from 06/29 meeting 1) Collapse checkbook NYC data to the project level to make it easier to work with (~926,562 records). Do not clean data, as it was indicated that the checks we thought were odd may be valid. 2) Familiarize yourselves with the Capital Projects Database aka CPDB 3) Run Checkbook NYC projects through CPDB categorization process and create report of number of projects in each category and number of $ in each category (1) ITT, Vehicles, and Equipment 2) Lump Sum 3) Fixed Asset or 4) NULL). I'm happy to help walk through the logic. 4) Join Checkbook NYC data onto CPDB. I think that this will best be done / setup with another DE team member to help get it going. 5) Report back results and discuss potential next steps. Future work may focus on specific agencies, or?

AmandaDoyle commented 1 year ago

Morning! Did some digging to 1) help collapse records and 2) join Checkbook NYC onto CPDB

To join checkbook NYC data onto CPDB it appears we need to remove the last three digits and any trailing white space from the Capital Project value. For example, if I search 998CAP2024 005 in CPDB here I do not get a result, but if I search 998CAP2024 I do get a match. Same thing with 841HWMBRT5 801 vs 841HWMBRT5.
Given this, I recommend 1) cleaning the Capital Project values by removing the last three digits and trailing whitespace and renaming the column to FMS ID 2) Group by the following columns: Agency, Budget Code, FMS ID (old Capital Project` value), SUM(Check Amount), Expense Category, Fiscal year, Spending Category. You can drop all other columns.

AmandaDoyle commented 1 year ago

@athursland and @DeaBardhoshi for tracking and note taking For cleaning the CheckbookNYC data please remove negative checks and checks with a value equal to $99,999,999

athursland commented 1 year ago

Going to try joining on cpdb. Jumping off of our conversation during stand-up, is the Digital Ocean file 23_Q2_build/latest/output/cpdb_projects.csv the right file to grab for now? @fvankrieken @damonmcc

fvankrieken commented 1 year ago

Yes, but with the caveat of no geometries in that file. The shapefiles in that folder have geometries, but obviously need to be converted into something you can use. I think you could use gdal/ogr2ogr for that but there's some overhead in that as well. Or actually I guess you could pull those into geopandas, if you're using geopandas already?

fvankrieken commented 1 year ago

Otherwise if it's easier for you, you could point a query at edm data. You'll need a new connection string to supply to sqlalchemy's create_engine function, in the format postgresql://{user}:{password}@{edm_data_url}:25060/defaultdb - args for that can be found with the rest of our secrets

Schema cpdb, table cpdb_opendata_projects_pts or cpdb_opendata_projects_poly (for points or polygons) I believe

athursland commented 1 year ago

I was planning on using geopandas but I think querying the data would probably be a better exercise! Will reach out if I have any roadblocks

athursland commented 1 year ago

Whenever convenient, can someone provide guidance on where to grab archived CPDB with geometries?

damonmcc commented 1 year ago

@athursland if you have access to the edm-publishing S3 bucket, I'm seeing some versions in db-cbdp/main/. but that only goes back to 2021

I'll keep looking for conveniently-located older versions

athursland commented 1 year ago

Notes on this sprint's work, @DeaBardhoshi feel free to fill in details on the categorization work:

athursland commented 1 year ago

Possible extensions:

DeaBardhoshi commented 1 year ago

Further directions/ideas:

  1. Building on the categorization work:
  1. NLP-related work for Projects pre-2017:
    • Exploring the info in Contract Purpose + incorporating that into adding spatial information
    • Looking into extracting place-name mentions (fuzzy str matching, NER or other methods)
athursland commented 1 year ago

Notes from Tuesday 7/11:

athursland commented 1 year ago

Notes from Wednesday 7/12:

athursland commented 1 year ago

Notes on odd ends of this sprint:

DeaBardhoshi commented 1 year ago

Some notes on further categorization directions with the Checkbook NYC data:

fvankrieken commented 1 year ago

So for norms around PRs:

athursland commented 1 year ago

Note for YAML file for reading in CPDB geoms: - currently uploading only the shapefile to DO, but we want to update this to include the whole CPDB geom subdirectory with all the relevant files @DeaBardhoshi

athursland commented 1 year ago

Notes on results from unit testing:

athursland commented 1 year ago

Ideas for extending testing:

athursland commented 1 year ago

Observations from exploring laying in parks properties geometries:

athursland commented 1 year ago

Notes about potential enhancements (running) :