dvarrazzo / www.varrazzo.com

My personal website
https://www.varrazzo.com
1 stars 1 forks source link

blog/2020/07/25/surviving-django/ #12

Open utterances-bot opened 4 years ago

utterances-bot commented 4 years ago

Surviving Django (if you care about databases) — Daniele Varrazzo

https://www.varrazzo.com/blog/2020/07/25/surviving-django/

pauloxnet commented 4 years ago

Hi Daniele, very interesting article, especially because your point of view is different from mine. You know much better than me how to "use all the features" of PostgreSQL.

Speaking of database abstraction I'm sure you "didn't need it", but I have to say that in my work experience a database change during an application lifecycle is not that unusual (eg completing a SQLite based Django prototype or migrate a mature Django project based on MariaDB).

I think migrations are a good compromise to align database tables with Django models. I remember how complicated migrations were in the pre-South era.

In the same way that you don't like (I'm sure you are able to) write or edit a Django automatically generated migration, to avoid data loss after a simple field rename, other people either don't want to or are not able to write SQL code to perform a basic database migration. I think in both cases it's necessary to know Django or PostgreSQL well to complete that task quickly and effectively.

In conclusion, I agree with you that it's difficult to use all the PostgreSQL features, relying only on Django ORM or migrations, but thanks to the work of all Django core contributors, release after release, new features are added in the PostgreSQL module of Django.

Anyway thanks also to articles like yours, Django developers can be more aware of the limitations of this framework that "you love" and can then try to make it better.

Paolo

dvarrazzo commented 4 years ago

Thank you for your input @pauloxnet!

I'm very pleased that Django, at every release, allows more and more of the Postgres features to be used: it's a great progression and it was good to grow up from a model where "the database is just a storage of object": that's a model that starts falling apart when things become less trivial (because of large db, because of particular access needs, because the database is also used/designed outside the Django app...)

I'm also pleased that the workflows around a Django system are flexible enough to allow to mix and match different approaches, according to the needs/expertise in a project. One of the things I care to point out is that you can use Django even on databases not explicitly designed for it or created by it, you can use Django even if you want to use the craziest index Postgres will come up in the next release: there framework is flexible enough to not force you to a limited usage of the db, and I think it's a great strength of the project.

andrewgodwin commented 4 years ago

I agree that at some point in a project/company's life, if it's big enough, SQL migrations are the way to go (I built South, and then Django Migrations, to supplement several systems like that which I'd previously built). Migrations in the out-of-the-box state are mostly there to supplement rapid prototyping, and then like a lot of Django, can be removed/ignored progressively if and when you outgrow the single set of design constraints we had to choose for them.

That said, have you explored using the Django migrations engine to run SQL migrations? There's a built-in Execute Many operation that allows you to just dump in a big SQL file in a pretty simple migration skeleton, and it'll then take care of the state tracking, application before tests, cross-app dependency ordering (if you want it) and other related features that you have to write yourself if you author a patch_db.py script.

Django migrations were never built for big companies to run fully automatically - my first advice on joining a place is generally to calm down on the use of makemigrations - but there's still an underlying execution engine that can handle SQL just as well as model operations.

dvarrazzo commented 4 years ago

Thank you @andrewgodwin: I like the idea of separating makemigrations from the underlying engine, I'll take a look at it.

Migrating models seems still at best an incomplete idea. I just discovered today that get_model() will return an object which can query the database, but it has no class attribute, I assume no methods either. This is not exactly documented.

dvarrazzo commented 4 years ago

The articles is receiving many comments on Hacker News I'm pleased about the discussion. Many readers had a different experience than mine and have reasons to praise the Django migration system: good to hear the other side of the story too.

pauloxnet commented 4 years ago

I was interested in different point of views about your article, so I shared it on Haker News and in Django developers mailing. I hope you do not mind.

dvarrazzo commented 4 years ago

I hope you do not mind.

Not at all! The different point of views are interesting.

The most interesting use case is for people developing a Django app to redistribute: for them a standard way to migrate the app schema is definitely a positive thing. I still think that who has a just a live system to develop on and maintain tends to stick to a single database at time (switching from one vendor to another, not supporting many at the same time) and that using SQLite as a baby database for testing doesn't really make for representative tests.

In Docker times it's not even that hard to have a self-contained Postgres or MySQL for the development of separate projects...

federicobond commented 4 years ago

Django migrations were never built for big companies to run fully automatically - my first advice on joining a place is generally to calm down on the use of makemigrations

That's a great advice @andrewgodwin! I would add that a careful editing of previous migrations when the underlying database fields would not be altered anyway is important for good migrations hygiene. Useless migrations can increase test startup time significantly.

That's not useful at all for the database, but Django will create it for you and if you remove it, it will add it back. Similarly, changing a choices list, a display label, results in migrations with no database operation, only a Python operation, and practically no SQL purpose.

This is perhaps out of an abundance of caution from Django over what arguments can trigger fields modifications in the db. In these cases, I usually find the latest mention of the field within the app migration files and just edit it manually. I never had any problems with that approach, when done carefully.

I did exactly that with the new JSONField introduced in Django 3.1. After checking that the implementation was the same as the contrib.postgres one, I edited the migrations to make it look like the fields were defined as the new fields all along.

It allows access to the state of the model at times intermediate between migrations, using get_model(appname, modelname) and with some Python code in the returned model. But if that code also happens to use any code inside your application, importing models with a normal Python import, things will crash because of a mismatch between model definitions and schema in the database.

I never had this problem using get_model, but the managers and querysets returned by get_model are of course limited to the built-in manager and queryset methods. Note that it would never be a good idea to directly import your current models in a migration. It's better to just copy&paste any code that is needed to run that particular migration.

No change needed to the database, but Django insisted on creating migration consisting pretty much of [...] So goodbye to your data! But who needs a bar anyway. Although I think I could've used a bar, and a few drinks, if that migration had hit production.

I think this is another case that can be (in most cases) easily resolved by just editing a previous migration to make it seem like it was the new definition all along.

polyrand commented 4 years ago

Hi Daniele,

I have to say I really enjoyed your article. I'm still improving my SQL, but my current thinking about data schemas, tables and migrations is exactly what you have described here. I use FastAPI and have never used Django, so I can't give my opinion regarding that.

One approach I have seen recently that I want to explore, is what Samuel Colvin has done here:

https://github.com/samuelcolvin/nosht/

There's a models.sql file with some parts separated by comments:

https://github.com/samuelcolvin/nosht/blob/c442a2895664ede086901cacd0c9b2d6a1e77610/py/shared/sql/models.sql#L219

And then there are a few functions to build the database:

https://github.com/samuelcolvin/nosht/blob/c442a2895664ede086901cacd0c9b2d6a1e77610/py/shared/db.py#L59

Or apply a patch defined in the models.sql above:

https://github.com/samuelcolvin/nosht/blob/c442a2895664ede086901cacd0c9b2d6a1e77610/py/shared/db.py#L163

https://github.com/samuelcolvin/nosht/blob/c442a2895664ede086901cacd0c9b2d6a1e77610/py/shared/db.py#L492

The workflow I want to tinker with is using raw SQL + data validation. For example using Pydantic. I have a gist with an example, it uses sqlalchemy to build the query but just for convenience. It still needs a lot of work:

https://gist.github.com/polyrand/b654a15f7986bcbcab53039e7eff1a78

igorgue commented 4 years ago

I use this tool called DBMate for this, it created migrations on SQL and even though it's database agnostic, its results only target one database, it just manages what's after -- up and --down comments and put them in a timestamped file, that's more fools proof than diffting.

dvarrazzo commented 4 years ago

Vsevolod Solovyov shared a similar experience and suggested Nomad, another migration tool proposing a less-is-more approach.

marcelofern commented 4 years ago

Hi there,

Thanks for writing this up, though I'm not convinced that this is proposal is benefitial? Or am I too blind to see the benefits?

It is unclear to me, after reading your post, when should someone transition from django-migrations to a raw sql approach. I.e, I don't want to maintain a migrations apparatus of my own for a small project if I can just plug-and-play with the conventional approach. I am also not too sure for a big project (and however big one might be), if this would be beneficial over time.

I've worked in projects where the whole db schema was created/updated from .sql. And the same for stored procedures and such. The overall take-home that I got is that those systems were much harder to maintain, and additionally it was complicate to train new people to learn our home-made intricacies. Ultimately our way-of-doing-things wasn't beneficial and down-time due to DB operations were usual. Dealing with testing was also much more complicatded, though I'm not gonna hold a high-stand opinion on that.

I understand giving up on the traditional way if you're using a feature from a specific db that breaks django migrations (postgres schemas might, from memory), but I'm not sure I see the advantage otherwise.

Thanks,

tomage commented 4 years ago

Thanks for a thought provoking article! Long time Django (0.96 IIRC) user here, mostly been in PSQL world last few years. Not many people that write against Django's core systems, so this is definitely a healthy read IMHO.

You mention the problem of adding a new field and how you think it can be done much simpler than what Django suggests. I'd like to suggest there's something else at work that can make Django's approach of separating the operations much more useful, and that your example above doesn't really do justice to the difference in doing things the Django-way vs. your way. You do mention this - but your actual SQL example doesn't really match with the one from Django's documentation, which makes it much simpler than it should be.

So anyway: If one doesn't care about some downtime/errors during release, one can easily add a not-nullable field, and backfill it, in a single Django migration file. No problemo.

But, if you want zero-downtime deployments, you cannot add a NOT NULL field straight up. Theoretically, if you could deploy the code at the exact same moment, sure.. But there's always some time between code deployment and schema update. Let's take an example:

Let's say you have old code running that doesn't know about the myapp_mymodel.uuid field. Then you add the field, using the SQL transaction from above. The moment it executes on the DB, you've broken insertions of new records into the myapp_mymodel table. Then you push out the codechanges, which fixes the problem.

Similar problem ensues if you migrate after deploying the code. Only this time, the new code hits the servers before the column is there. Now pretty much any interaction with myapp_mymodel is broken, until the schema change has been executed.

Django's suggestion to break adding in the field into multiple migrations makes it easier to add a new field in a bulletproof manner. I actually like to lay it out in several steps:

  1. Run the "before" migration (adds in the field, but nullable)
  2. Backfill current records
  3. Deploy the new code. As there might have been added records via old code since after step #3, the new code cannot be 100% dependent on these being NOT NULL.
  4. Harden the field - by making it NOT NULL.
  5. If there were provisions in the new code to still allow for nullable values, one can now remove those provisions, and expect there never to be a NULL value in the column/field.

Following Django's migration documentation might make it easier to execute on a plan like the one above, if zero-downtime is your game.

dvarrazzo commented 4 years ago

Hello @marcelodufe,

my impression with Django migrations is that they automate two simple use case (the ALTER TABLE ... ADD COLUMN and REMOVE COLUMN) leaving the users to sort out all the rest. I guess the benefit of this approach vs. the shortcoming of this approach depends on what "all the rest" means in your project.

I think there are two approaches in the development of a software system: one is "django-centric", following the framework's path and using the database the way Django suggests: in this approach you may not feel a strong need to push the envelope of what you can do with a migration. A different approach, "db-centric", would be to use all the features that a database offers, as long as Django can coexist with them. A simple example: a table might be so big and with a certain access pattern that you may want to partition it. For Django it would be just a simple table: partitions are effectively transparent. Django might not have (yet) a "partitioned table" feature but that doesn't actively stop you from using it if you introduce it in a way that doesn't disturb its model. So, two groups of people, the ones which say "Django doesn't have that feature, you cannot use it", the ones which say "I know my database has that feature, I want to use it, because django is a ORM+Middleware+Web and it shouldn't be bothered".

The two things that break the most, once you start wanting from your database in a way Django doesn't expect, are data definition and changes detection. Note that the ORM doesn't break: the rule of the game is "do whatever is legal as long as the ORM doesn't break", because by then the advantages of Django start disappearing (no 3rd party libraries, no admin...) and you may as well use flask+jinja. I am not for writing manually all or most of the queries: the ORM is a godsent piece of software. So this article pretty much wants to say that "if you like Django's features you shouldn't be put off by whatever it seems it makes impossible to use in its model, and these are some of the things I routinely do to allow myself to tap into some extra feature I know my database has".

If you happen to not have a beloved database feature that seems out of reach by reading Django docs then of course it's hard to see much benefit in writing manually what Django can do automatically, and for your use case probably the migrations are good enough.

Another point I gather from the description of your system is that it sounds like using heavily stored procedures, so keeping a lot of logic into the database. That is effectively hard to maintain, or e.g. to have different code versions to coexist - it is pretty much global code. So maybe some of the difficulty of that system wasn't to write a migration doing an ALTER TABLE ... ADD COLUMN, but to try to test and evolve a system that is pretty a single code snapshot - that's not a thing I would advocate.

Thank you for sharing your experience!

dvarrazzo commented 4 years ago

Hi @tomage ,

I have skimmed on that example because it seemed trivial to figure out what to do differently if someone wanted a gradual migration to a new non-null field. The article is already 4500 words long, it's already a hefty reading as it is...

Of course if you want an incremental migration you can do it in following step as well, whether you use Django migrations or any other way. For instance it might be:

  1. add the field, nullable
  2. deploy code which can coexist with the new field in null state but shouldn't produce any
  3. backfill the code
  4. introduce the constraint.

This scenario is suggested indeed by the Django docs, but then their tool creates a different migration, so it's one of those cases where you have the tool to write some boilerplate but then you have to do some manual work, understanding what you are doing, in order to achieve the result you want. I made an example with three statements in a single patch: I though a reader would figure out themselves that they could write the same statements in two or three patches and to do something else in between.

My example is simplistic in many other ways: if you really do BEGIN; ALTER TABLE ...; UPDATE ... you are taking an exclusive lock so your application cannot write until all the records are updated - you definitely don't want that. On the other hand Postgres has the possibility of creating INVALID constraints and to VALIDATE them later: a migration strategy might make use of it but again if you want to do it within the Django framework you have to write your own RunSQL instead of changing the Python model that represents some of the details.

karthicraghupathi commented 4 years ago

agnostic is another database agnostic migration tool in which might be of interest to readers.