add QAQC of source data

damonmcc commented 1 year ago

resolves #256

approach overview

each data product we build has a list of source datasets
before building a data product, we'd like to compare the staging versions of source data to reference versions
- the reference version is the one used in a chosen build version of the data product
- the staging version is very often the latest version
each source dataset is expected to be in our Digital Ocean S3 bucket edm-recipes as a MySQL dump file (.sql)
to compare source data, we load the reference and staging versions into our persistent postigs database in Digital Ocean (often known as EDM_DATA)
- there's now a new database and schema in it called edm-qaqc.source_data

changes

add a new streamlit component in sources_report.py
generalize and improve structure of streamlit pages (aka components) used for Zoning Tax Lots (aka ztl, db-zoningtaxlots)
use new sources report component in db-zoningtaxlots
add new functions to digital_ocean_utils.py (formerly digital_ocean_client.py
create postgres_utils.py with many new functions
use contants.py to declare common values at a high level across all modules and streamlit components
add unit tests for some of the new python functions (have been running them locally for now)
simplify the primary webpage (index.,py) by improving the importing of data product pages
rename setup script folder from dev/ to bash/ and call scripts in both Dockerfiles in a consistent way
configure how bandit linting flags security vulnerabilities in the repo using bandit.yaml
add .data/ folder to gitignore. it's a local cache of .sql files pulled from data-library to populate QAQC DB

notes

in the interest of building an MVP and code review, this has only been implemented for db-zoningtaxlots. work was started to expand to db-facilities but it doesn't seem worth making this PR any bigger

will do in a follow up PR

expand source data QAQC to db-facilities because it's a good candidate to try next (has neither of the expected version-related output files that db-zoningtaxlots has)
add a new github action to run new python tests in CI. so far the new tests have only been run locally
consider expanding details of source data checks (how many new rows? are row count changes expected? ...)

screenshots

damonmcc commented 1 year ago

todos

[x] improve names of function-like constants in constants.py and util files
[x] add a note with the placeholder example of manually declared source data versions (link) and perhaps in places it's related to
[x] don't use a placeholder envar with BUILD_ENGINE = os.getenv("SQL_ENGINE_EDM_DATA", "")

fvankrieken commented 1 year ago

A comment around engines and connections - this is from sqlalchemy docs

The Engine is intended to normally be a permanent fixture established up-front and maintained throughout the lifespan of an application. It is not intended to be created and disposed on a per-connection basis; it is instead a registry that maintains both a pool of connections as well as configurational information about the database and DBAPI in use, as well as some degree of internal caching of per-database resources.

So engine is more intended to be constant, and connections should be spun up and disposed per function/transaction/etc rather than the engine itself

damonmcc commented 1 year ago

@mbh329

There is a lot to digest but I generally like the changes. The function names are specific and atomic, the reorg. is nice and reduces duplicate code, and using the ZTL page to spec. out the source data QAQC is great. I think the changes made to add the source data QAQC is well thought out for an initial round of changes and will be helpful to spot problems with input data.

❤️

Is there a reason behind having 3 different versions of the same table for the sources report type in the ZTL page?

if you're talking about the 3 things in the DEV DEBUG SECTION, the reason was just to have various views of the "source report" while I was building out the real sections above it. I'll drop it

I don't know if I totally understand what the bandit.yml file is doing based on the PR write up and how they interact with the tests

yea this was new to me. bandit is like black but for security vulnerabilities, so it was warning about things like "you're using requests with no timeout" or B101: Test for use of assert and flagging them as highlights in vs code

we aren't running any tests that fail because of this, but since bandit is installed in the devcontainer and I found all the other checks really valuable, it seemed worth silencing the assert_used check in test files that will definitely use assert statements

damonmcc commented 1 year ago

todos

[x] optimize sql utils

damonmcc commented 1 year ago

@fvankrieken

A comment around engines and connections - this is from sqlalchemy docs

The Engine is intended to normally be a permanent fixture established up-front and maintained throughout the lifespan of an application. It is not intended to be created and disposed on a per-connection basis; it is instead a registry that maintains both a pool of connections as well as configurational information about the database and DBAPI in use, as well as some degree of internal caching of per-database resources.

So engine is more intended to be constant, and connections should be spun up and disposed per function/transaction/etc rather than the engine itself

makes sense. but if you mean we should be using sqlalchemy.Connection and remove any use of sqlalchemy.create_engine, I'm not seeing any use of a connection being created without create_engine in those docs. and for consistency across our repos, I looked and this is the approach used in all of em.

some notes on why I went with creating/disposing of an engine for every transaction:

this isn't an app or database with tons concurrent users, nor does the app do high frequency or potentially conflicting read/write operations. so optimizing connection resources and pooling didn't seem like a high priority
allows for a very functional approach which seems way simpler than having to create and pass around a persistent custom "SQL engine" class for all functions that would need one
without Engine.dispose(), the database was hitting limit of distinct connections

fvankrieken commented 1 year ago

@fvankrieken

A comment around engines and connections - this is from sqlalchemy docs

The Engine is intended to normally be a permanent fixture established up-front and maintained throughout the lifespan of an application. It is not intended to be created and disposed on a per-connection basis; it is instead a registry that maintains both a pool of connections as well as configurational information about the database and DBAPI in use, as well as some degree of internal caching of per-database resources.

So engine is more intended to be constant, and connections should be spun up and disposed per function/transaction/etc rather than the engine itself

makes sense. but if you mean we should be using sqlalchemy.Connection and remove any use of sqlalchemy.create_engine, I'm not seeing any use of a connection being created without create_engine in those docs. and for consistency across our repos, I looked and this is the approach used in all of em.

some notes on why I went with creating/disposing of an engine for every transaction:

this isn't an app or database with tons concurrent users, nor does the app do high frequency or potentially conflicting read/write operations. so optimizing connection resources and pooling didn't seem like a high priority

allows for a very functional approach which seems way simpler than having to create and pass around a persistent custom "SQL engine" class for all functions that would need one

without Engine.dispose(), the database was hitting limit of distinct connections

Fair points. I don't think we'd necessarily have to pass it around though, from googling it seems something along the lines of this might be the best practice for engines in streamlit apps (though this was less clear than I thought it'd be)

@st.cache(allow_output_mutation=True)
def get_connection():
    return create_engine('...')

Then you just have to reference this elsewhere in code, which will make sure that the engine only gets created once. @st_cache is a streamlit-specific caching property

fvankrieken commented 1 year ago

Ah I was not familiar with it, but looks like you're well familiar with st_cache as you've used it throughout the PR 🙂

damonmcc commented 1 year ago

Fair points. I don't think we'd necessarily have to pass it around though, from googling it seems something along the lines of this might be the best practice for engines in streamlit apps (though this was less clear than I thought it'd be)
@st.cache(allow_output_mutation=True)
def get_connection():
    return create_engine('...')
Then you just have to reference this elsewhere in code, which will make sure that the engine only gets created once. @st_cache is a streamlit-specific caching property

that could work! although from the docs, looks like cache has been deprecated in favor of using cache_data and cache_resources

I did try using cache_data more often in new/existing functions, but was it ended up being more of pain than a benefit during dev (when expecting something to change it wasn't due to the cache). I'll try the get_connection and cache_resources approach in https://github.com/NYCPlanning/data-engineering-qaqc/pull/260

edit: I'll hold off on adding this to #260 to keep it a simple fix

NYCPlanning / data-engineering-qaqc