[Admin Info] Large Galaxy instances and database performance

natefoo commented 6 years ago

Edit: much of this is now in gxadmin

Original Post: There are a few manual maintenance steps I occasionally take with the usegalaxy.org database that are extremely quick to do but that would take a lot more time to implement properly in the code. So I figured it would be good to document them and to collect any such practices from other Galaxy admins. The information in this issue apply to PostgreSQL only.

Beware

Make sure your Galaxy DB is backed up before making any changes. You should be doing this anyway. usegalaxy.org does so with Continuous Archiving and Point-in-Time Recovery (PITR) using scripts in the galaxyproject.postgresql Ansible role. The usage of this role is in the usegalaxy-playbook (specifically, here are the vars).

Always do things in transactions. This protects you from accidentally executing UPDATEs and DELETEs without conditions (assuming you notice that you did it and ROLLBACK the transaction`). This is done with:

db=> BEGIN;
BEGIN
db=> UPDATE job SET state = 'bork';
UPDATE 19165533
db=> ROLLBACK;
ROLLBACK
db=> SELECT count(*) FROM job WHERE state = 'bork';
 count 
-------
     0
(1 row)

To actually commit the changes in the transaction, you would use COMMIT instead of ROLLBACK.

Jobs in the 'new' state

Over time, you will collect jobs in the new state that will never run - there are multiple normal things that can cause this, some of which may have changed by now, but these have included: deleting a history with a running workflow, not resuming a paused workflow, handler problems, jobs waiting on failed metadata to be fixed, users never returning to Galaxy, etc.

Eventually these jobs will pile up and, because the job handlers query the new state to see what jobs are available to run, this can slow down the query. As a result, I occasionally change the state of old jobs manually with a query such as:

UPDATE job
SET state = 'new_manually_dropped'
WHERE state = 'new'
  AND update_time < now() - interval '3 months'

After making changes to a lot of rows in a table, use PostgreSQL's ANALYZE to re-evaluate the table contents for the solver (this analysis affects whether PostgreSQL uses an index or table scan to complete a query):

VACUUM ANALYZE job

You can add VERBOSE after ANALYZE for verbose output.

However, the best solution to this (and one that is PostgreSQL-specific) is to create a partial index on state = 'new':

CREATE INDEX ix_partial_job_state_new ON job (state)
WHERE state = 'new'

Note that CREATE INDEX locks the table for updates, so it will cause your Galaxy server to hang on any job-related requests/actions. It's best to stop handlers while performing this index. Your users will not be able to submit jobs during this period as well. You can also create indexes concurrently, but there are caveats to be aware of if doing this.

There is probably a ratio of new state to non-new state rows above which even the partial index will fail to perform well (or the planner will choose to perform a table scan), so occasionally changing new state jobs to e.g. new_manually_dropped is still worthwhile.

pgcleanup.py

If you use pgcleanup.py, it logs these events to cleanup_event_* tables which you may occasionally want to clean.

You can do so with a DELETE FROM query similar to the UPDATE above, e.g.:

DELETE
FROM cleanup_event_dataset_association
WHERE create_time < now() - interval '3 months'

Note that disk space is not recovered in this process. For that you must perform a VACUUM FULL, explained in the next section.

Deleting data

Generally, it's not possible to delete data from the Galaxy database due to the complex relationships that are maintained between users, jobs, datasets, etc., and the sharing of datasets. However, there are a few cases where it's possible. For example, we did so with the Tool Shed's database to remove old rows from galaxy_session because previously, they were being created for every request that did not have an existing session cookie (search engine spiders, Galaxy's previously-on-by-default "check for updates" on startup). However, you should not do this without a very strong understanding of the Galaxy data model and what errors you will encounter by deleting rows.

That said, if you have deleted rows in the database and need to reclaim space from them, you need to perform a full vacuum on the effected tables. You should also do an ANALYZE at this time for the reasons explained above. You can perform these steps with, e.g.:

VACUUM FULL ANALYZE galaxy_session

You can again add VERBOSE after FULL for verbose output.

Other things

I'll add more as I think of them.

hrhotz commented 6 years ago

Please allow me to add three more points (see also: https://docs.google.com/presentation/d/1l4DD0IaJjuvk1zAT1Sjv26bLyrSOg3VUm7rD-TQl_Zs which is currently linked from https://galaxyproject.github.io/training-material/topics/admin/tutorials/database-schema/tutorial.html ):

Jobs in the 'running' state On very rare occasions, we do have jobs which are not running anymore (the process is has been killed) but are labelled as "running" on the "Manage jobs" page in the admin section. Using the "Stop Jobs" option does not work to remove such "runaway jobs". => change "state" to 'error'
Table galaxy_user (case1) After switching external authentication originally handled by Apache to the (new) Galaxy built in support to connect to LDAP, you need to set external = 'f' for all existing users.
Table galaxy_user (case2) Using external authentication, the provided e-mail address of a user might change and the user ends up with two accounts. Try to catch this issue as early as possible, delete the new user in table and rename the 'email' in the old account according to the new email.

natefoo commented 6 years ago

Thanks for these additions @hrhotz!

mvdbeek commented 6 years ago

I think we can probably be a bit more aggressive about the jobs in new state -- in fact jobs tracked in-memory (but not jobs tracked in the db) already do the following:

Fail if input metadata failed
Fail if input deleted (includes deleted histories)

I think killing those jobs would be a benefit for users wondering why their jobs are not running (xref #6022) and it would be less burden on admins as well.

hexylena commented 5 years ago

I've published https://github.com/usegalaxy-eu/gxadmin to help with this.

We have a bunch of SQL queries we had been running and we collected them in a single script for convenience, and finally got around to sharing it! @natefoo has added a couple queries as well. If anyone else has some queries they run regularly, please include them.

Some of the parts are a bit EU specific but we're working to decrease that.

I specifically did not include the query for fixing "jobs in new state" because that should be fixed in galaxy now, thanks @mvdbeek!

galaxyproject / galaxy