Closed damonmcc closed 1 year ago
todos
constants.py
and util filesBUILD_ENGINE = os.getenv("SQL_ENGINE_EDM_DATA", "")
A comment around engines and connections - this is from sqlalchemy docs
The Engine is intended to normally be a permanent fixture established up-front and maintained throughout the lifespan of an application. It is not intended to be created and disposed on a per-connection basis; it is instead a registry that maintains both a pool of connections as well as configurational information about the database and DBAPI in use, as well as some degree of internal caching of per-database resources.
So engine is more intended to be constant, and connections should be spun up and disposed per function/transaction/etc rather than the engine itself
@mbh329
There is a lot to digest but I generally like the changes. The function names are specific and atomic, the reorg. is nice and reduces duplicate code, and using the ZTL page to spec. out the source data QAQC is great. I think the changes made to add the source data QAQC is well thought out for an initial round of changes and will be helpful to spot problems with input data.
❤️
- Is there a reason behind having 3 different versions of the same table for the
sources
report type in the ZTL page?
if you're talking about the 3 things in the DEV DEBUG SECTION
, the reason was just to have various views of the "source report" while I was building out the real sections above it. I'll drop it
- I don't know if I totally understand what the bandit.yml file is doing based on the PR write up and how they interact with the tests
yea this was new to me. bandit
is like black
but for security vulnerabilities, so it was warning about things like "you're using requests with no timeout" or B101: Test for use of assert
and flagging them as highlights in vs code
we aren't running any tests that fail because of this, but since bandit
is installed in the devcontainer and I found all the other checks really valuable, it seemed worth silencing the assert_used
check in test files that will definitely use assert statements
todos
@fvankrieken
A comment around engines and connections - this is from sqlalchemy docs
The Engine is intended to normally be a permanent fixture established up-front and maintained throughout the lifespan of an application. It is not intended to be created and disposed on a per-connection basis; it is instead a registry that maintains both a pool of connections as well as configurational information about the database and DBAPI in use, as well as some degree of internal caching of per-database resources.
So engine is more intended to be constant, and connections should be spun up and disposed per function/transaction/etc rather than the engine itself
makes sense. but if you mean we should be using sqlalchemy.Connection
and remove any use of sqlalchemy.create_engine
, I'm not seeing any use of a connection being created without create_engine
in those docs. and for consistency across our repos, I looked and this is the approach used in all of em.
some notes on why I went with creating/disposing of an engine for every transaction:
Engine.dispose()
, the database was hitting limit of distinct connections@fvankrieken
A comment around engines and connections - this is from sqlalchemy docs
The Engine is intended to normally be a permanent fixture established up-front and maintained throughout the lifespan of an application. It is not intended to be created and disposed on a per-connection basis; it is instead a registry that maintains both a pool of connections as well as configurational information about the database and DBAPI in use, as well as some degree of internal caching of per-database resources.
So engine is more intended to be constant, and connections should be spun up and disposed per function/transaction/etc rather than the engine itself
makes sense. but if you mean we should be using
sqlalchemy.Connection
and remove any use ofsqlalchemy.create_engine
, I'm not seeing any use of a connection being created withoutcreate_engine
in those docs. and for consistency across our repos, I looked and this is the approach used in all of em.some notes on why I went with creating/disposing of an engine for every transaction:
- this isn't an app or database with tons concurrent users, nor does the app do high frequency or potentially conflicting read/write operations. so optimizing connection resources and pooling didn't seem like a high priority
- allows for a very functional approach which seems way simpler than having to create and pass around a persistent custom "SQL engine" class for all functions that would need one
- without
Engine.dispose()
, the database was hitting limit of distinct connections
Fair points. I don't think we'd necessarily have to pass it around though, from googling it seems something along the lines of this might be the best practice for engines in streamlit apps (though this was less clear than I thought it'd be)
@st.cache(allow_output_mutation=True)
def get_connection():
return create_engine('...')
Then you just have to reference this elsewhere in code, which will make sure that the engine only gets created once. @st_cache is a streamlit-specific caching property
Ah I was not familiar with it, but looks like you're well familiar with st_cache as you've used it throughout the PR 🙂
Fair points. I don't think we'd necessarily have to pass it around though, from googling it seems something along the lines of this might be the best practice for engines in streamlit apps (though this was less clear than I thought it'd be)
@st.cache(allow_output_mutation=True) def get_connection(): return create_engine('...')
Then you just have to reference this elsewhere in code, which will make sure that the engine only gets created once. @st_cache is a streamlit-specific caching property
that could work! although from the docs, looks like cache
has been deprecated in favor of using cache_data
and cache_resources
I did try using cache_data
more often in new/existing functions, but was it ended up being more of pain than a benefit during dev (when expecting something to change it wasn't due to the cache). I'll try the get_connection
and cache_resources
approach in https://github.com/NYCPlanning/data-engineering-qaqc/pull/260
edit: I'll hold off on adding this to #260 to keep it a simple fix
resolves #256
approach overview
edm-recipes
as a MySQL dump file (.sql
)EDM_DATA
)edm-qaqc.source_data
changes
sources_report.py
ztl
,db-zoningtaxlots
)db-zoningtaxlots
digital_ocean_utils.py
(formerlydigital_ocean_client.py
postgres_utils.py
with many new functionscontants.py
to declare common values at a high level across all modules and streamlit componentsindex.,py
) by improving the importing of data product pagesdev/
tobash/
and call scripts in both Dockerfiles in a consistent waybandit
linting flags security vulnerabilities in the repo usingbandit.yaml
.data/
folder to gitignore. it's a local cache of.sql
files pulled from data-library to populate QAQC DBnotes
db-zoningtaxlots
. work was started to expand todb-facilities
but it doesn't seem worth making this PR any biggerwill do in a follow up PR
db-facilities
because it's a good candidate to try next (has neither of the expected version-related output files thatdb-zoningtaxlots
has)screenshots