maxachis commented 5 months ago

SQLAlchemy is an Object-Relational Mapping (ORM) tool that represents SQL database actions in Python code.

Why use SQLAlchemy?

Support exists for executing raw SQL statements where necessary. SQLAlchemy doesn't become an either-or when compared to SQL: it can compliment it.
- Reduced risk for SQL injection: SQLInjection is considerably more difficult in SQLAlchemy as compared with using raw sql, and the interface implicitly discourages use of it. Only when using raw SQL does the risk increase.
Reduced need to understand SQL -- useful especially for developers less familiar with SQL. This enables backend develops to work more consistently within the environment of Python.
Database-agnostic: Outside of raw SQL queries such as the above, SQLAlchemy can function with different SQL dialects, from SQLite, PostgreSQL, and Oracle. Thus, if we want to change databases, this can be done with minimal changes to our code.
This enables us to represent tables and views in Python classes, which enhances the in-repository documentation -- rather than having to switch between the database and the python code to see how tables are represented, we can have in-repository models which express the tables and their relationship with each other and which can be more easily viewed from an IDE.
Representing SQL in SQLAlchemy can reduce the amount of code, especially for simpler, more repetitive queries. It does this by abstracting boilerplate code and enabling leaner python syntax in replacement of wordier SQL syntax. That can make maintenace easier
Integrates with Flask via Flask-SQLAlchemy
Strong community support with a lot of documentation.

Why not use SQLAlchemy?

Overhead for learning SQLAlchemy logic: SQLAlchemy can be thought of as a kind of transitional interface between Python and SQL, which will itself have to be learned.
Models represented in code will need to be updated as their corresponding tables are updated. There are some workarounds for this, such as Alembic, which are worth considering later on, but which I have little experience.
Performance overhead: SQLAlchemy offers an additional layer of abstraction, which means that queries can be slower, especially in the case of more complex queries
Implicit behavior: SQLAlchemy, by its nature, hides the SQL queries being executed. These queries can be viewed for debugging purposes, but generally speaking, using SQLAlchemy requires a degree of trust that it's executing the intended functions.

My thoughts

SQLAlchemy's advantages outweigh its downsides, and SQLAlchemy is widely used for precisely that reason. It reduces language switching, reduces security risks through SQLInjection, enhances self-documentation, and generally reduces the total number of lines of code. If we were working in an environment where performance to the tune of milliseconds count, it'd be a different story. Since we don't, the performance overhead is acceptable and well-compensated by how it will make things easier for developers.
Converting from raw SQL to SQLAlchemy will take time, but can be piecemeal, gradually replacing components with SQLAlchemy.

Resources:

SQLAlchemy: What is it? What's it for?

maxachis commented 3 months ago

@EvilDrPurple When we get further along, I'd like to look into SQLAlchemy's connection pooling functionality. #376 occurred in part because we only had one active connection being managed at a time, so having multiple connections in a connection pool might be a way to resolve the problem in a more thorough way than the one I offered for that issue.

This is not a remotely high priority at the moment, but something I wanted to keep on the radar for eventual exploration.

maxachis commented 3 months ago

So after attempting to integrate the changes @EvilDrPurple made with the ones I'm making in #318, I identified a few issues that weren't immediately apparent when just looking at the SQLAlchemy changes in isolation:

To get everything working with the SQLAlchemy conversion, a lot of code outside of the DatabaseClient is also impacted. app.py as well as pytest tests and fixtures have to be changed, and sometimes substantially. This poses a problem when other already-substantial changes, such as #318, are also occurring at the same time. It's a lot of logic that all needs to be resolved simultaneously.
Currently, the SQLAlchemy code does not play well with Psycopg2 code -- my guess is that it partly has to do with different connection strings being maintained at the same time, which have to be kept up to date with each other.
Flask-SQLAlchemy introduces a global variable which poses some additional issues -- this makes it harder to isolate tests, for example. It's possible that SQLAlchemy alone would be more avoidable.
There's a bug where multiple connections are being formed but not eliminated as more tests are run. Again, this seems partly related to the global variable issue I previously mentioned, where additional connections are floating in the ether unclosed and accumulating as more tests are run.

The end result of all of this is that it turns out that even a piecemeal implementation of SQLAlchemy becomes considerably more difficult, because too much logic is tightly coupled in the code base at the moment to enable easy shifts to a new way of interacting with the database, and SQLAlchemy (or at least Flask-SQLAlchemy) in particular encourages the use of global variables that make reconciling changes more challenging.

The way I see it, we have a few options here, taking into account the current limitations as well as that other, important changes like #318 are also in the pipeline:

Try to reconcile the existing #318 code to SQLAlchemy. That's what I was working on this morning, but after a few hours my progress was quite slow, and spaghetti code was being produced.
Rollback the code and reconfigure the SQLAlchemy changes to minimally affect the rest of the code. A lot of the most important SQLAlchemy logic can probably be preserved, but new work will be required. Auth code will be pushed forward, and SQLAlchemy changes will need to account for that.
Rollback the code and table SQLAlchemy for now until a later point when things in the code base are generally less coupled, making it easier to implement SQLAlchemy logic.
Rollback the code and table SQLAlchemy, possibly indefinitely.

josh-chamberlain commented 3 months ago

@maxachis your instinct about priority is right—the auth work must continue so that API changes may continue. SQLAlchemy is a good idea, but...but. It does make sense that it wouldn't play nice with Psycopg2, and it seems like Psycopg2 would be made unnecessary with SQLAlchemy?

maxachis commented 3 months ago

@maxachis your instinct about priority is right—the auth work must continue so that API changes may continue. SQLAlchemy is a good idea, but...but. It does make sense that it wouldn't play nice with Psycopg2, and it seems like Psycopg2 would be made unnecessary with SQLAlchemy?

Exactly. We only need Psycopg2 or SQLAlchemy, and the only reason for having both is because we're transitioning from one to the other. But part of that transition is that we want to minimize the friction in that transition, and have as little be upset as possible.

I do want to get @EvilDrPurple 's feedback on this, but my sense is that we will need to rollback. It's one thing if we only have to change some functions in DatabaseClient and maybe a few other limited areas, but it's another thing if changes to DatabaseClient necessitate considerable changes to app.py, the tests, and the fixtures.

On a long enough timescale, the time spent overhauling everything to SQLAlchemy right now would eventually, I believe, pay for itself. But currently our timescale is short because we want to get v2 out the door!

josh-chamberlain commented 3 months ago

ok! As the two back end devs, y'all have my permission to make the transition abrupt/friction-ful, because we're protected by the fork and dev environment. Thanks for thinking this through.

maxachis commented 3 months ago

Alright! For the moment, we are reverted, and @EvilDrPurple is going to look into a more modular approach while I work on auth!

Current plan is as follows:

[x] First, simplifying the DB Client so that it's pulling a connection object, and adjusting tests to account for that
[ ] Then, we look into incorporating the SQLAlchemy based off of that singular connection object, or if we can't do that, in making sure the psycopg2 and SQLAlchemy connections are tightly related
[ ] Then we start expanding outwards from there.

Police-Data-Accessibility-Project / data-sources-app

Migrate backend code to SQLAlchemy #303

Why use SQLAlchemy?

Why not use SQLAlchemy?

My thoughts

Resources: