IACR / latex-submit

Web server to receive uploaded LaTeX and execute it in a docker container.
GNU Affero General Public License v3.0
11 stars 0 forks source link

Database architecture currently uses SQLAlchemy on sqlite #29

Closed kmccurley closed 10 months ago

kmccurley commented 1 year ago

We are now using a database to track a bunch of things:

  1. PaperStatus of a paper. There is a unique key of paperid. This is used to track the paper through the workflow.
  2. CompileRecord stores information about a compilation (including metadata). The primary key is (paperid, version) with paperid as a foreign key to PaperStatus. There will be several CompileRecord for a paper.
  3. Discussion contains copyedit discussion items. These reference the paperid as foreign keys.
  4. logevents. These reference the paperid as a foreign key.
  5. users on the site. There is no reference to papers, so this table is independent.

Each of these tables corresponds to an object in db_model.py, and we use the Object-relational feature of SQLAlchemy to track these in the database. The foreign key relationships should be maintained, but unfortunately sqlite3 has only limited support for foreign keys and it requires special handling in the table creation and connection to maintain the "on delete cascade" constraint. We need to make sure that our definitions of objects guarantee that the foreign key relationships are obeyed correctly. This may require us to switch to mysql, mariadb, or pymysql.

This is complicated by the fact that I used the flask-sqlalchemy package API to define our objects.

kmccurley commented 1 year ago

It occurs to me that I've created a bit of a mess in the way I have persisted data on disk, and I think the solution is to remove SQLAlchemy.

I started out using pydantic to define objects. This gives an easy way to serialize as JSON, and I was writing those to disk. Later I discovered in issue #12 that writing JSON to disk was not a good way to achieve persistence, so I switched to a database in which I stored the serialization of the pydantic Compilation object. This was an easy transition, but it now means we have a dependency on both pydantic and sqlalchemy. As an alternative we could have used the ORM model of SQLAlchemy to define the entire hierarchy of the Compilation object, but that's a gigantic commitment to SQLAlchemy.

My feeling at this time is that SQLAlchemy is a cause of problems.

  1. the documentation of SQLAlchemy is a mess. There are tiny fragments of code, but there are five ways to write things and it's never clear which is best.
  2. The "imedance mismatch" of ORM is starting to show its ugly head. There are times when you want to use a join like fetching both PaperStatus and CompileRecord for a version, but you shouldn't have to always fetch the PaperStatus with the CompileRecord. You still have to write SQL, but now you have to write SQL in some object-oriented method that obscures the actual SQL being run.
  3. the change to SQLAlchemy 2.0 is apparently not recognized by flask-sqlalchemy, and they deprecated the original way of writing queries. By contrast, SQL has been stable for decades.

One reason to use SQLAlchemy is to use a pool of database connections, but my feeling is that the traffic is so low that we will never experience a problem with creating a new connection for each request (this is how eprint was written).

At this point I lean toward ripping out SQLAlchemy. The original use was for flask-login to have a User class, but that part is easy to code up without SQLAlchemy.

kmccurley commented 1 year ago

After spending a couple of days reading about SQLAlchemy, I came to the conclusion was that I used the flask-sqlalchemy package instead of just SQLAlchemy. One of the consequences is that models are defined in nonstandard syntax (by sqlalchemy standards anyway), and flask-sqlalchemy depends on version 1.4 of SQLAlchemy. The most recent version of SQLAlchemy is 2.0, and this is a fairly major change in the APIs.

The only thing that we used from flask-sqlalchemy was cleanup of sessions after each request, but it turns out that there is an easy alternative in SQLAlchemy itself, namely scoped_session. This is built using connection pools, and the only thing we have to do is make sure to release the session at the end of each request.

For now I think the solution is to remove flask-sqlalchemy as suggested here and upgrade to sqlalchemy 2.0. This preserves our ability to write simple ORM queries. The foreign key cascade problem seems to have a different solution, namely to use ondelete='CASCADE' in ForeignKey definitions, and use cursor.execute('PRAGMA foreign_keys=ON') in the rare case when we are deleting a PaperStatus.

One problem that this solves for us over the approach used in IACR/eprint is that tests can make use of sqlite, even if the production system eventually shifts to postgres or mysql or mariadb.

kmccurley commented 10 months ago

SQLAlchemy has been updated to 2.0. I'm still not convinced that it's worth keeping because it feels easier to just write SQL. There are two ways to use SQLAlchemy, namely as the core layer and the ORM layer. We are mostly using the ORM layer, which allows us to manipulate objects instead of raw Result sets. The two paradigms can be mixed. By keeping SQLAlchemy it makes it easy for us to write tests using sqlite, but use mariadb in production.