NYCPlanning / data-engineering-qaqc

streamlit app for data engineering
https://edm-data-engineering.nycplanningdigital.com
1 stars 0 forks source link

geometries visualization #151

Closed td928 closed 2 years ago

td928 commented 2 years ago

86 at least two reviewers required 🏘️ (I am not keeping completely straight with the emoji yet but this is more than 1 i am guessing?)

Overview

the last piece of the current CPDB qaqc iteration. Visualizing the two main geometries files in order to assess whether something out of wack is going on like we saw previously after the snap to grid fix. It took longer than I anticipated not because the visualization itself because I took on refactoring the data ingestion process.

geometry_visualization_report

I did not end up getting a direct streaming of the files from DO but opted the approach to save the extracted geometries files locally then read in to geopandas. Once this is done, the visualization itself is pretty straight forward with geopandas plotting functionalities. I do wonder if an improvement to make is to give it a basemap but it requires some thinking around what tool the basemap should come in from. My go-to is plotly but plotly does not like plotting with shapefiles so some conversion of polygons is required. If my thinking is more convoluted than it needs to be on this point please let me know.

helpers.py

helpers function are consolidated which mostly surrounds the data ingestion process. I think in the end @abrieff approach won out with now the getting zip files and getting csv looks much more alike and functions are dumber in that they took a single url as way to find their way in s3. On the other hand, the return object is stored similarly as in my other work as well as Jingyi's which is dictionary with name of the tables as keys.

Oysters1874 commented 2 years ago

okay, seems like I am still having problems when installing geopandas inside the container. The error messages says a GDAL API version must be specified. I will try to investigate it first.

mbh329 commented 2 years ago

I think a basemap would be a nice improvement but don't think it needs to be implemented in this PR. As is, it (would) show us if there any records that are totally egregiously outside NYC's boundaries but after doing some review of records in the table that are falling outside of the NYC boundaries (water included) they should be getting captured as being within NYC. I don't know how much time should be spent on this as its only 40 ~ records and they are mapping in the Capital Planning Explorer. I included a screenshot of a record that I looked at

Screen Shot 2022-08-03 at 5 08 33 PM
mbh329 commented 2 years ago

@Oysters1874 I didn't have any issues installing geopandas in the container, let me know if you still have issues

td928 commented 2 years ago

went with a slightly different implementation for the shapefile conflict issue but should be similar concept though. Add a random feature branch for testing and let me know if it makes sense @abrieff Thanks!

Oysters1874 commented 2 years ago

@Oysters1874 I didn't have any issues installing geopandas in the container, let me know if you still have issues

thank you so much! I have figured it out. Now it works.

Oysters1874 commented 2 years ago

looks good on my side as well

Oysters1874 commented 2 years ago

One tiny thing: there are two 'the' at the beginning of the description.