NYCPlanning / data-engineering-qaqc

streamlit app for data engineering
https://edm-data-engineering.nycplanningdigital.com
1 stars 0 forks source link

add QAQC of source data #258

Closed damonmcc closed 1 year ago

damonmcc commented 1 year ago

resolves #256

approach overview

changes

notes

will do in a follow up PR

screenshots

damonmcc commented 1 year ago

todos

fvankrieken commented 1 year ago

A comment around engines and connections - this is from sqlalchemy docs

The Engine is intended to normally be a permanent fixture established up-front and maintained throughout the lifespan of an application. It is not intended to be created and disposed on a per-connection basis; it is instead a registry that maintains both a pool of connections as well as configurational information about the database and DBAPI in use, as well as some degree of internal caching of per-database resources.

So engine is more intended to be constant, and connections should be spun up and disposed per function/transaction/etc rather than the engine itself

damonmcc commented 1 year ago

@mbh329

There is a lot to digest but I generally like the changes. The function names are specific and atomic, the reorg. is nice and reduces duplicate code, and using the ZTL page to spec. out the source data QAQC is great. I think the changes made to add the source data QAQC is well thought out for an initial round of changes and will be helpful to spot problems with input data.

❤️

  • Is there a reason behind having 3 different versions of the same table for the sources report type in the ZTL page?

if you're talking about the 3 things in the DEV DEBUG SECTION, the reason was just to have various views of the "source report" while I was building out the real sections above it. I'll drop it

  • I don't know if I totally understand what the bandit.yml file is doing based on the PR write up and how they interact with the tests

yea this was new to me. bandit is like black but for security vulnerabilities, so it was warning about things like "you're using requests with no timeout" or B101: Test for use of assert and flagging them as highlights in vs code

we aren't running any tests that fail because of this, but since bandit is installed in the devcontainer and I found all the other checks really valuable, it seemed worth silencing the assert_used check in test files that will definitely use assert statements

damonmcc commented 1 year ago

todos

damonmcc commented 1 year ago

@fvankrieken

A comment around engines and connections - this is from sqlalchemy docs

The Engine is intended to normally be a permanent fixture established up-front and maintained throughout the lifespan of an application. It is not intended to be created and disposed on a per-connection basis; it is instead a registry that maintains both a pool of connections as well as configurational information about the database and DBAPI in use, as well as some degree of internal caching of per-database resources.

So engine is more intended to be constant, and connections should be spun up and disposed per function/transaction/etc rather than the engine itself

makes sense. but if you mean we should be using sqlalchemy.Connection and remove any use of sqlalchemy.create_engine, I'm not seeing any use of a connection being created without create_engine in those docs. and for consistency across our repos, I looked and this is the approach used in all of em.

some notes on why I went with creating/disposing of an engine for every transaction:

fvankrieken commented 1 year ago

@fvankrieken

A comment around engines and connections - this is from sqlalchemy docs

The Engine is intended to normally be a permanent fixture established up-front and maintained throughout the lifespan of an application. It is not intended to be created and disposed on a per-connection basis; it is instead a registry that maintains both a pool of connections as well as configurational information about the database and DBAPI in use, as well as some degree of internal caching of per-database resources.

So engine is more intended to be constant, and connections should be spun up and disposed per function/transaction/etc rather than the engine itself

makes sense. but if you mean we should be using sqlalchemy.Connection and remove any use of sqlalchemy.create_engine, I'm not seeing any use of a connection being created without create_engine in those docs. and for consistency across our repos, I looked and this is the approach used in all of em.

some notes on why I went with creating/disposing of an engine for every transaction:

  • this isn't an app or database with tons concurrent users, nor does the app do high frequency or potentially conflicting read/write operations. so optimizing connection resources and pooling didn't seem like a high priority
  • allows for a very functional approach which seems way simpler than having to create and pass around a persistent custom "SQL engine" class for all functions that would need one
  • without Engine.dispose(), the database was hitting limit of distinct connections

Fair points. I don't think we'd necessarily have to pass it around though, from googling it seems something along the lines of this might be the best practice for engines in streamlit apps (though this was less clear than I thought it'd be)

@st.cache(allow_output_mutation=True)
def get_connection():
    return create_engine('...')

Then you just have to reference this elsewhere in code, which will make sure that the engine only gets created once. @st_cache is a streamlit-specific caching property

fvankrieken commented 1 year ago

Ah I was not familiar with it, but looks like you're well familiar with st_cache as you've used it throughout the PR 🙂

damonmcc commented 1 year ago

Fair points. I don't think we'd necessarily have to pass it around though, from googling it seems something along the lines of this might be the best practice for engines in streamlit apps (though this was less clear than I thought it'd be)

@st.cache(allow_output_mutation=True)
def get_connection():
    return create_engine('...')

Then you just have to reference this elsewhere in code, which will make sure that the engine only gets created once. @st_cache is a streamlit-specific caching property

that could work! although from the docs, looks like cache has been deprecated in favor of using cache_data and cache_resources

I did try using cache_data more often in new/existing functions, but was it ended up being more of pain than a benefit during dev (when expecting something to change it wasn't due to the cache). I'll try the get_connection and cache_resources approach in https://github.com/NYCPlanning/data-engineering-qaqc/pull/260

edit: I'll hold off on adding this to #260 to keep it a simple fix