Open natefoo opened 6 years ago
Please allow me to add three more points (see also: https://docs.google.com/presentation/d/1l4DD0IaJjuvk1zAT1Sjv26bLyrSOg3VUm7rD-TQl_Zs which is currently linked from https://galaxyproject.github.io/training-material/topics/admin/tutorials/database-schema/tutorial.html ):
Jobs in the 'running' state On very rare occasions, we do have jobs which are not running anymore (the process is has been killed) but are labelled as "running" on the "Manage jobs" page in the admin section. Using the "Stop Jobs" option does not work to remove such "runaway jobs". => change "state" to 'error'
Table galaxy_user (case1) After switching external authentication originally handled by Apache to the (new) Galaxy built in support to connect to LDAP, you need to set external = 'f' for all existing users.
Table galaxy_user (case2) Using external authentication, the provided e-mail address of a user might change and the user ends up with two accounts. Try to catch this issue as early as possible, delete the new user in table and rename the 'email' in the old account according to the new email.
Thanks for these additions @hrhotz!
I think we can probably be a bit more aggressive about the jobs in new state -- in fact jobs tracked in-memory (but not jobs tracked in the db) already do the following:
I think killing those jobs would be a benefit for users wondering why their jobs are not running (xref #6022) and it would be less burden on admins as well.
I've published https://github.com/usegalaxy-eu/gxadmin to help with this.
We have a bunch of SQL queries we had been running and we collected them in a single script for convenience, and finally got around to sharing it! @natefoo has added a couple queries as well. If anyone else has some queries they run regularly, please include them.
Some of the parts are a bit EU specific but we're working to decrease that.
I specifically did not include the query for fixing "jobs in new state" because that should be fixed in galaxy now, thanks @mvdbeek!
Edit: much of this is now in gxadmin
Original Post: There are a few manual maintenance steps I occasionally take with the usegalaxy.org database that are extremely quick to do but that would take a lot more time to implement properly in the code. So I figured it would be good to document them and to collect any such practices from other Galaxy admins. The information in this issue apply to PostgreSQL only.
Beware
Make sure your Galaxy DB is backed up before making any changes. You should be doing this anyway. usegalaxy.org does so with Continuous Archiving and Point-in-Time Recovery (PITR) using scripts in the galaxyproject.postgresql Ansible role. The usage of this role is in the usegalaxy-playbook (specifically, here are the vars).
Always do things in transactions. This protects you from accidentally executing
UPDATE
s andDELETE
s without conditions (assuming you notice that you did it andROLLBACK
the transaction`). This is done with:To actually commit the changes in the transaction, you would use
COMMIT
instead ofROLLBACK
.Jobs in the 'new' state
Over time, you will collect jobs in the
new
state that will never run - there are multiple normal things that can cause this, some of which may have changed by now, but these have included: deleting a history with a running workflow, not resuming a paused workflow, handler problems, jobs waiting on failed metadata to be fixed, users never returning to Galaxy, etc.Eventually these jobs will pile up and, because the job handlers query the
new
state to see what jobs are available to run, this can slow down the query. As a result, I occasionally change the state of old jobs manually with a query such as:After making changes to a lot of rows in a table, use PostgreSQL's
ANALYZE
to re-evaluate the table contents for the solver (this analysis affects whether PostgreSQL uses an index or table scan to complete a query):You can add
VERBOSE
afterANALYZE
for verbose output.However, the best solution to this (and one that is PostgreSQL-specific) is to create a partial index on
state = 'new'
:Note that
CREATE INDEX
locks the table for updates, so it will cause your Galaxy server to hang on any job-related requests/actions. It's best to stop handlers while performing this index. Your users will not be able to submit jobs during this period as well. You can also create indexes concurrently, but there are caveats to be aware of if doing this.There is probably a ratio of
new
state to non-new
state rows above which even the partial index will fail to perform well (or the planner will choose to perform a table scan), so occasionally changingnew
state jobs to e.g.new_manually_dropped
is still worthwhile.pgcleanup.py
If you use pgcleanup.py, it logs these events to
cleanup_event_*
tables which you may occasionally want to clean.You can do so with a
DELETE FROM
query similar to theUPDATE
above, e.g.:Note that disk space is not recovered in this process. For that you must perform a
VACUUM FULL
, explained in the next section.Deleting data
Generally, it's not possible to delete data from the Galaxy database due to the complex relationships that are maintained between users, jobs, datasets, etc., and the sharing of datasets. However, there are a few cases where it's possible. For example, we did so with the Tool Shed's database to remove old rows from
galaxy_session
because previously, they were being created for every request that did not have an existing session cookie (search engine spiders, Galaxy's previously-on-by-default "check for updates" on startup). However, you should not do this without a very strong understanding of the Galaxy data model and what errors you will encounter by deleting rows.That said, if you have deleted rows in the database and need to reclaim space from them, you need to perform a full vacuum on the effected tables. You should also do an
ANALYZE
at this time for the reasons explained above. You can perform these steps with, e.g.:You can again add
VERBOSE
afterFULL
for verbose output.Other things
I'll add more as I think of them.