galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.4k stars 1.01k forks source link

Allow purging users, jobs completely from the DB #13738

Open bgruening opened 2 years ago

bgruening commented 2 years ago

Currently, it is not possible to purge users with all their associated jobs, HDAs etc from the database. To be GDPR compliant we overwrite the entries with nonsense, but they are still in the DB. Recently we have also accumulated huge amounts of entries in Galaxy with projects like COVID-19. Those bot accounts have 20+ more million jobs and we would like to remove this in a few years as well. We will soon see the first Galaxy servers with a PG database > 1TB. In the long run, this is not sustainable and it's completely fine to remove users/data that are older than X (10?) years etc.

Another aspect is that with scalable and reproducible history export we can export and import histories very easily and users should use long-term archives like Zenodo/Invenio to store histories and their metadata. We could imagine exporting a history from a user account and completely purging all associated entries, if requested, or if an admin wants to.

This feature can live outside Galaxy if needed and an admin cleanup script.

Drawbacks:

hexylena commented 2 years ago

This will interfere with a few statistics that we collect over time.

for the main metadata, you'll have "snapshots" (maybe these should be somewhere more durable), users at date X or so (and you'll be able to check things like select max(id) from job to figure out some of the counts.

And for stats like job runtimes, maybe better they expire!

jdavcs commented 2 years ago

Removing a user record from the database is difficult. There are 53 tables that have foreign keys referencing the galaxy_user table. Deleting a row from that table would require handling all those potentially non-empty relationships. Given that the primary goal is to recover space, we would like to delete related rows, not set their user_id to null. This can be done with automated cascading deletes or by having a script walk (a subset of) the relationship graph.

We can set cascading in the model on the foreign key constraint + the relationship attribute. Assuming we want children to be deleted, what will happen is when a parent is deleted, the session will issue delete statements to the database for all children (that are loaded into the current session), then the parent itself, after which the database will take over and delete all children records. However:

  1. We have 297 foreign keys and over 400 relationships specified in the galaxy model. Cascading is setup on 3 foreign keys and on 9 relationships. "More" work needs to be done.

  2. Not all should be deletes. We have a dense relationship graph; figuring out the right on-delete setting may be difficult.

  3. Not all related records can be deleted. Here's a trivial example: user creates history, history gets tagged with a new tag, creating a tag and a history_tag_association record. We delete the user, cascade-deleting the associated history, which cascade-deletes the history_tag_association. The newly created tag remains.

  4. ON DELETE CASCADE may be dangerous: it's too easy to delete the wrong thing with all of its related records (and their related records, etc.). I suspect that with a sufficiently old/active user account, it would be next to impossible to determine what got deleted.

  5. Setting to null instead of deleting. One potential issue may be different behavior of inner joins: unlike an outer join, joining on a record that is null will not return a row. (I don't have an example from our model and I don't know if we even have such cases, but I think we'll need to check for such things).

I may be overthinking this. However, if this indeed is difficult, one option is to treat compliance and freeing resources as separate issues. Then, I think, overwriting a user record with nonsence is sufficient if that keeps us in compliance with regulation. As for freeing resources, maybe we don't need to delete the user record at all, but instead delete the space-hogging related records (starting with job and friends), which, also, don't have 53 foreign keys referencing them.

And to do that, we could address this step-by-step, setting up proper and correct cascade settings, with careful and comprehensive testing. Thus, we'd be auto-deleting stuff in a more deliberate manner and on a much smaller scale, compared to deleting a user and "their extended family". Selectively deleting would be an option too.

bernt-matthias commented 1 year ago

Xref https://github.com/galaxyproject/galaxy/issues/930