NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

Data Engineering fellows notes #23

Closed damonmcc closed 9 months ago

damonmcc commented 11 months ago

a place for links and discussions for fellows @DeaBardhoshi and @athursland

general links

Historical Capital Spending

AmandaDoyle commented 11 months ago

@athursland and @DeaBardhoshi see running list of notes below to guide work for Thursday's meeting. cc @fvankrieken (please feel free to add)

I think the next steps are to:

Happy to collaborate however is most helpful!

AmandaDoyle commented 11 months ago

https://github.com/NYCPlanning/edm-overview/issues/819#issuecomment-1610085125

AmandaDoyle commented 11 months ago

Next steps from 06/29 meeting 1) Collapse checkbook NYC data to the project level to make it easier to work with (~926,562 records). Do not clean data, as it was indicated that the checks we thought were odd may be valid. 2) Familiarize yourselves with the Capital Projects Database aka CPDB 3) Run Checkbook NYC projects through CPDB categorization process and create report of number of projects in each category and number of $ in each category (1) ITT, Vehicles, and Equipment 2) Lump Sum 3) Fixed Asset or 4) NULL). I'm happy to help walk through the logic. 4) Join Checkbook NYC data onto CPDB. I think that this will best be done / setup with another DE team member to help get it going. 5) Report back results and discuss potential next steps. Future work may focus on specific agencies, or?

AmandaDoyle commented 11 months ago

Morning! Did some digging to 1) help collapse records and 2) join Checkbook NYC onto CPDB

To join checkbook NYC data onto CPDB it appears we need to remove the last three digits and any trailing white space from the Capital Project value. For example, if I search 998CAP2024 005 in CPDB here I do not get a result, but if I search 998CAP2024 I do get a match. Same thing with 841HWMBRT5 801 vs 841HWMBRT5.
Given this, I recommend 1) cleaning the Capital Project values by removing the last three digits and trailing whitespace and renaming the column to FMS ID 2) Group by the following columns: Agency, Budget Code, FMS ID (old Capital Project` value), SUM(Check Amount), Expense Category, Fiscal year, Spending Category. You can drop all other columns.

AmandaDoyle commented 11 months ago

@athursland and @DeaBardhoshi for tracking and note taking For cleaning the CheckbookNYC data please remove negative checks and checks with a value equal to $99,999,999

athursland commented 11 months ago

Going to try joining on cpdb. Jumping off of our conversation during stand-up, is the Digital Ocean file 23_Q2_build/latest/output/cpdb_projects.csv the right file to grab for now? @fvankrieken @damonmcc

fvankrieken commented 11 months ago

Yes, but with the caveat of no geometries in that file. The shapefiles in that folder have geometries, but obviously need to be converted into something you can use. I think you could use gdal/ogr2ogr for that but there's some overhead in that as well. Or actually I guess you could pull those into geopandas, if you're using geopandas already?

fvankrieken commented 11 months ago

Otherwise if it's easier for you, you could point a query at edm data. You'll need a new connection string to supply to sqlalchemy's create_engine function, in the format postgresql://{user}:{password}@{edm_data_url}:25060/defaultdb - args for that can be found with the rest of our secrets

Schema cpdb, table cpdb_opendata_projects_pts or cpdb_opendata_projects_poly (for points or polygons) I believe

athursland commented 11 months ago

I was planning on using geopandas but I think querying the data would probably be a better exercise! Will reach out if I have any roadblocks

athursland commented 11 months ago

Whenever convenient, can someone provide guidance on where to grab archived CPDB with geometries?

damonmcc commented 11 months ago

@athursland if you have access to the edm-publishing S3 bucket, I'm seeing some versions in db-cbdp/main/. but that only goes back to 2021

I'll keep looking for conveniently-located older versions

athursland commented 11 months ago

Notes on this sprint's work, @DeaBardhoshi feel free to fill in details on the categorization work:

athursland commented 11 months ago

Possible extensions:

DeaBardhoshi commented 11 months ago

Further directions/ideas:

  1. Building on the categorization work:
  1. NLP-related work for Projects pre-2017:
    • Exploring the info in Contract Purpose + incorporating that into adding spatial information
    • Looking into extracting place-name mentions (fuzzy str matching, NER or other methods)
athursland commented 10 months ago

Notes from Tuesday 7/11:

athursland commented 10 months ago

Notes from Wednesday 7/12:

athursland commented 10 months ago

Notes on odd ends of this sprint:

DeaBardhoshi commented 10 months ago

Some notes on further categorization directions with the Checkbook NYC data:

fvankrieken commented 10 months ago

So for norms around PRs:

athursland commented 10 months ago

Note for YAML file for reading in CPDB geoms: - currently uploading only the shapefile to DO, but we want to update this to include the whole CPDB geom subdirectory with all the relevant files @DeaBardhoshi

athursland commented 10 months ago

Notes on results from unit testing:

athursland commented 10 months ago

Ideas for extending testing:

athursland commented 9 months ago

Observations from exploring laying in parks properties geometries:

athursland commented 9 months ago

Notes about potential enhancements (running) :