etianen / django-reversion

django-reversion is an extension to the Django web framework that provides version control for model instances.
https://django-reversion.readthedocs.io
BSD 3-Clause "New" or "Revised" License
3.03k stars 489 forks source link

[enhancement] Storing serialized_data in a JSONField instead of a TextField? #938

Closed scur-iolus closed 1 year ago

scur-iolus commented 1 year ago

Since version 3.1. (released in August 2020, almost 3 years ago), Django includes models.JSONField (and forms.JSONField) that can be used on all supported database backends. The model field supports the introspection, lookups, and transforms. Previously, this feature was only available to PostgreSQL users. This JSONField could allow us to store semi-structured data in the database.

For now, django-reversion uses a models.TextField to store serialized data. Wouldn't it be nice to replace that field by a JSONField when the data is stored as JSON, that is to say when version.format=="json"? Storing JSON encoded data instead of raw strings would enable us to benefit from JSON querying in Django ORM, which could pave the way for plenty of new use cases. For instance: filtering very efficiently a QuerySet of Version instances by the content of the underlying serialized data. That would be all more consistent that the vast majority of django-reversion users probably already stores serialized data as JSON (since this is the default serialization format).

Assuming the pull request #928 will be merged soon, this could come under a new setting in the DEFAULT dictionary, which could be named USE_JSONFIELD (or anything else more explicit). Of course, it would be False by default, for backward compatibility with previous versions (😅) of the extension.

I'm aware that such a feature would imply significant changes in django-reversion / could even lead to breaking changes / memory overhead, depending on the implementation. That would obviously not be an easy step. But in the long run, don't you think that the idea might be relevant?

etianen commented 1 year ago

I'm not sure how this could be implemented. A database field that is one type or another based on a setting is tricky, since the migrations would have to vary depending on that setting too! And if a user changed the setting after running the migrations, a new migration would have to be made, and where would Django store it?

I'm not sure Django really supports the idea of a database field that is dependent on a setting. I fear this would cause a lot of weird behavior for people.

filtering very efficiently a QuerySet of Version instances

Unfortunately, this would still likely involve a sequential table scan in postgres. It's likely that for large tables this would still be prohibitive.

scur-iolus commented 1 year ago

Thank you for your feedback (and for your work here).

I had the feeling that such a change could quickly become a burden in terms of development and maintenance cost. But I wanted to hear your perspectives on this idea. You're right that this is probably not feasible without deeply rethinking the extension / without significant breaking changes, which is not an option...

Unfortunately, this would still likely involve a sequential table scan in postgres. It's likely that for large tables this would still be prohibitive.

That being said, I'm not an expert on the subject but isn't it exactly the contrary? According to Django documentation:

PostgreSQL users PostgreSQL has two native JSON based data types: json and jsonb. The main difference between them is how they are stored and how they can be queried. PostgreSQL’s json field is stored as the original string representation of the JSON and must be decoded on the fly when queried based on keys. The jsonb field is stored based on the actual structure of the JSON which allows indexing. The trade-off is a small additional cost on writing to the jsonb field. JSONField uses jsonb.

As I understand it, that means PostgreSQL users (the vast majority of Django users) should be able to add indexes to JSONFields. Admittedly, this seems to be a recent & not commonly used feature in Django. So from a pure theoretical point of view, should you rewrite the extension from scratch for PostgreSQL users only & for users who would never use anything else than JSON to serialize data, don't you think the idea could make sense? (Yeah, that makes quite a few conditions)

etianen commented 1 year ago

Rewriting this project to only support PostgreSQL and JSON serialization would allow indexing on the versioned data, yes.

The loss of general usefulness to support a single use-case seems like a bad tradeoff to me.

If efficient introspection of the versioned data is required, I'd suggest going one step further and having a dedicated version table for each model. That would avoid any JSON overhead whatsoever.