Closed kmccurley closed 1 year ago
It occurs to me that I've created a bit of a mess in the way I have persisted data on disk, and I think the solution is to remove SQLAlchemy.
I started out using pydantic to define objects. This gives an easy way to serialize as JSON, and I was writing those to disk. Later I discovered in issue #12 that writing JSON to disk was not a good way to achieve persistence, so I switched to a database in which I stored the serialization of the pydantic Compilation object. This was an easy transition, but it now means we have a dependency on both pydantic and sqlalchemy. As an alternative we could have used the ORM model of SQLAlchemy to define the entire hierarchy of the Compilation object, but that's a gigantic commitment to SQLAlchemy.
My feeling at this time is that SQLAlchemy is a cause of problems.
One reason to use SQLAlchemy is to use a pool of database connections, but my feeling is that the traffic is so low that we will never experience a problem with creating a new connection for each request (this is how eprint was written).
At this point I lean toward ripping out SQLAlchemy. The original use was for flask-login to have a User class, but that part is easy to code up without SQLAlchemy.
After spending a couple of days reading about SQLAlchemy, I came to the conclusion was that I used the flask-sqlalchemy
package instead of just SQLAlchemy. One of the consequences is that models are defined in nonstandard syntax (by sqlalchemy standards anyway), and flask-sqlalchemy
depends on version 1.4 of SQLAlchemy. The most recent version of SQLAlchemy is 2.0, and this is a fairly major change in the APIs.
The only thing that we used from flask-sqlalchemy
was cleanup of sessions after each request, but it turns out that there is an easy alternative in SQLAlchemy itself, namely scoped_session
. This is built using connection pools, and the only thing we have to do is make sure to release the session at the end of each request.
For now I think the solution is to remove flask-sqlalchemy
as suggested here and upgrade to sqlalchemy 2.0. This preserves our ability to write simple ORM queries. The foreign key cascade problem seems to have a different solution, namely to use ondelete='CASCADE'
in ForeignKey
definitions, and use cursor.execute('PRAGMA foreign_keys=ON')
in the rare case when we are deleting a PaperStatus
.
One problem that this solves for us over the approach used in IACR/eprint
is that tests can make use of sqlite, even if the production system eventually shifts to postgres or mysql or mariadb.
SQLAlchemy has been updated to 2.0. I'm still not convinced that it's worth keeping because it feels easier to just write SQL. There are two ways to use SQLAlchemy, namely as the core layer and the ORM layer. We are mostly using the ORM layer, which allows us to manipulate objects instead of raw Result
sets. The two paradigms can be mixed. By keeping SQLAlchemy it makes it easy for us to write tests using sqlite, but use mariadb in production.
We are now using a database to track a bunch of things:
PaperStatus
of a paper. There is a unique key of paperid. This is used to track the paper through the workflow.CompileRecord
stores information about a compilation (including metadata). The primary key is (paperid, version) with paperid as a foreign key toPaperStatus
. There will be severalCompileRecord
for a paper.Discussion
contains copyedit discussion items. These reference the paperid as foreign keys.Each of these tables corresponds to an object in db_model.py, and we use the Object-relational feature of SQLAlchemy to track these in the database. The foreign key relationships should be maintained, but unfortunately sqlite3 has only limited support for foreign keys and it requires special handling in the table creation and connection to maintain the "on delete cascade" constraint. We need to make sure that our definitions of objects guarantee that the foreign key relationships are obeyed correctly. This may require us to switch to mysql, mariadb, or pymysql.
This is complicated by the fact that I used the flask-sqlalchemy package API to define our objects.