Deduplication of media files

jsangmeister commented 2 years ago

To reduce the database size, we could save the hash of each file together with the content. If one uploads a new file, the hash of it is first calculated and checked if it is already present in the database. If it is, then the content of the old file is linked with the new id which was uploaded. This requires a new 1:n table which links ids to their content. This requires a migration which I'm not sure is possible with the current setup...

jsangmeister commented 2 years ago

Regarding the migrations: If we implement this, we should make it at least a little future-proof. The simplest solution would be to just append statements to the schema.sql which is fine for schema migrations, but data migrations are not possible by doing this, and we need a data migration to initially create the "metadata"/"link" table. There are probably frameworks out there which do exactly this, but I'm no expert on this. An initial short search yielded of course results like SQLAlchemy or other full-bloated SQL frameworks, which might be overkill, but can maybe be used for only this functionality. Regarding migration-only projects, the only thing I found was yoyo, which I'm not sure if we should use it, since it seems small and not very regularily maintained. The other alternative would be of course to write the migration framework from scratch, which is probably feasible since we do not need many features, but would probably still be overkill, especially since we currently only need it for this feature, which does not have high priority.

These are my thoughts so far. Maybe someone else hase more experience regarding SQL/Postgres/Python migrations (@peb-adr @gsiv @r-peschke)?

gsiv commented 2 years ago

The other alternative would be of course to write the migration framework from scratch, which is probably feasible since we do not need many features, but would probably still be overkill, especially since we currently only need it for this feature, which does not have high priority.

I agree that we shouldn't need a third-party framework for this but I also don't think writing a simple mechanism for ourselves would be overkill, even if we aim to enable (data) migrations from any given previous version, as we should. I expect that we'll eventually need this for the other services as well.

To achieve this, wouldn't it suffice to attach a version to the schema and apply the migrations, i.e., numbered SQL scripts, as necessary during the container's start-up routine?

The only snag is going to be, as always, the coordination between scaled services. Unlike with the backend service, we should plan for a locking mechanism from the start this time but that, too, should be attainable. Remember that it needs to work with pgbouncer's transaction pooling mode though.

OpenSlides / openslides-media-service

Deduplication of media files #46