galaxyproject / galaxy-helm

Minimal setup required to run Galaxy under Kubernetes
MIT License
38 stars 36 forks source link

Database default space allowance gets filled fairly quickly #401

Open pcm32 opened 1 year ago

pcm32 commented 1 year ago

After only 10 executions or so of our single cell pipeline, the database disk got full:

root@galaxy-galaxy-dev-postgres-0:/home/postgres/pgdata# df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdc        9.8G  9.7G  129M  99% /home/postgres/pgdata

is there any process in place that is cleaning the database with some regularity? Or should we suggest a higher default for the postgres disk space? I have been cleaning up disk space regularly, but I suspect that the database logs of jobs is not getting cleaned up as part of this process. None of the maintenance jobs take care of this?

pcm32 commented 1 year ago

I have run:

galaxy@galaxy-dev-job-0-6c8f594ff5-cwzd9:/galaxy/server$ bash scripts/maintenance.sh --no-dry-run --days 1

inside the job container, but I get this failure:

galaxy@galaxy-dev-job-0-6c8f594ff5-cwzd9:/galaxy/server$ bash scripts/maintenance.sh
Unsetting $PYTHONPATH
Activating virtualenv at .venv

Dry run: false
Days: 1

Will run following commands and output in maintenance.log
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 r --delete_userless_histories >> maintenance.log
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 -r --purge_histories >> maintenance.log
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 -r --purge_datasets >> maintenance.log
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 -r --purge_folders >> maintenance.log
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 -r --delete_datasets >> maintenance.log
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 -r --purge_datasets >> maintenance.log
Traceback (most recent call last):
  File "/galaxy/server/scripts/cleanup_datasets/cleanup_datasets.py", line 702, in <module>
    main()
  File "/galaxy/server/scripts/cleanup_datasets/cleanup_datasets.py", line 212, in main
    delete_datasets(app, cutoff_time, args.remove_from_disk, info_only=args.info_only, force_retry=args.force_retry)
  File "/galaxy/server/scripts/cleanup_datasets/cleanup_datasets.py", line 383, in delete_datasets
    (app.model.Dataset.table.c.id, app.model.Dataset.table.c.state),
AttributeError: type object 'Dataset' has no attribute 'table'

it seems to happen on:

python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 -r --delete_datasets >> maintenance.log

it does return 1 error code though, so I would guess that the maintenance jobs should be detecting this error?

pcm32 commented 1 year ago

However, even after running the subsequent step to the failing one (plus all others that run before), the disk usage in the database is more less the same...

pcm32 commented 1 year ago

Main culprit seems to be:

 table_schema |             table_name              | total_size | data_size | external_size
--------------+-------------------------------------+------------+-----------+---------------
 public       | history_dataset_association_history | 1193 MB    | 936 kB    | 1192 MB
 public       | history_dataset_association         | 394 MB     | 5168 kB   | 389 MB
 public       | galaxy_session                      | 2712 kB    | 1448 kB   | 1264 kB
 public       | job                                 | 1544 kB    | 920 kB    | 624 kB
 public       | tool_shed_repository                | 1432 kB    | 1056 kB   | 376 kB
 public       | job_parameter                       | 1336 kB    | 936 kB    | 400 kB
 public       | job_state_history                   | 944 kB     | 640 kB    | 304 kB
 public       | dataset                             | 760 kB     | 256 kB    | 504 kB
 public       | dataset_collection_element          | 576 kB     | 192 kB    | 384 kB
 public       | job_to_input_dataset                | 480 kB     | 216 kB    | 264 kB
(10 rows)

interestingly the data size is very small, maybe there is some postgres purge or something that is not happening?

ok, apparently in postgres speak, external size means the size with external indices, references, etc.

nuwang commented 1 year ago

Try setting: .Values.postgresql.persistence.size. If I remember right, the operator will attempt to resize the disk. If that doesn't happen, you might have to resize manually.

Regarding the maintenance failure, The maintenance cron job doesn't run that maintenance script, it only runs a job for cleaning up the tmpdir. But I think we should include this script as well. I think that the silent job failure should be reported on Galaxy, it's probable that it's affecting a lot of people.

pcm32 commented 1 year ago

But I think we should include this script as well.

Agreed, will look to where it should go.

I think that the silent job failure should be reported on Galaxy, it's probably that it's affecting a lot of people.

Sure, will do!

pcm32 commented 1 year ago

So, for the record here, changing .Values.postgresql.persistence.size meant that the operator attempted a live resize of the disk, which corrupted it :-( and meant that I had to redo the deployment (no issues since this is a development setup for testing). But probably better to do that manually (downscaling setup manually to make sure is not done in "hot"). Maybe we should have small section about on the readme re-sizing disks.

pcm32 commented 1 year ago

Also, I was told by Nicola I think that the scripts/maintenance.sh --no-dry-run doesn't actually attempt to delete anything on the database, so there must be another mechanism.