Open pcm32 opened 1 year ago
I have run:
galaxy@galaxy-dev-job-0-6c8f594ff5-cwzd9:/galaxy/server$ bash scripts/maintenance.sh --no-dry-run --days 1
inside the job container, but I get this failure:
galaxy@galaxy-dev-job-0-6c8f594ff5-cwzd9:/galaxy/server$ bash scripts/maintenance.sh
Unsetting $PYTHONPATH
Activating virtualenv at .venv
Dry run: false
Days: 1
Will run following commands and output in maintenance.log
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 r --delete_userless_histories >> maintenance.log
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 -r --purge_histories >> maintenance.log
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 -r --purge_datasets >> maintenance.log
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 -r --purge_folders >> maintenance.log
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 -r --delete_datasets >> maintenance.log
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 -r --purge_datasets >> maintenance.log
Traceback (most recent call last):
File "/galaxy/server/scripts/cleanup_datasets/cleanup_datasets.py", line 702, in <module>
main()
File "/galaxy/server/scripts/cleanup_datasets/cleanup_datasets.py", line 212, in main
delete_datasets(app, cutoff_time, args.remove_from_disk, info_only=args.info_only, force_retry=args.force_retry)
File "/galaxy/server/scripts/cleanup_datasets/cleanup_datasets.py", line 383, in delete_datasets
(app.model.Dataset.table.c.id, app.model.Dataset.table.c.state),
AttributeError: type object 'Dataset' has no attribute 'table'
it seems to happen on:
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 -r --delete_datasets >> maintenance.log
it does return 1
error code though, so I would guess that the maintenance jobs should be detecting this error?
However, even after running the subsequent step to the failing one (plus all others that run before), the disk usage in the database is more less the same...
Main culprit seems to be:
table_schema | table_name | total_size | data_size | external_size
--------------+-------------------------------------+------------+-----------+---------------
public | history_dataset_association_history | 1193 MB | 936 kB | 1192 MB
public | history_dataset_association | 394 MB | 5168 kB | 389 MB
public | galaxy_session | 2712 kB | 1448 kB | 1264 kB
public | job | 1544 kB | 920 kB | 624 kB
public | tool_shed_repository | 1432 kB | 1056 kB | 376 kB
public | job_parameter | 1336 kB | 936 kB | 400 kB
public | job_state_history | 944 kB | 640 kB | 304 kB
public | dataset | 760 kB | 256 kB | 504 kB
public | dataset_collection_element | 576 kB | 192 kB | 384 kB
public | job_to_input_dataset | 480 kB | 216 kB | 264 kB
(10 rows)
interestingly the data size is very small, maybe there is some postgres purge or something that is not happening?
ok, apparently in postgres speak, external size means the size with external indices, references, etc.
Try setting: .Values.postgresql.persistence.size
. If I remember right, the operator will attempt to resize the disk. If that doesn't happen, you might have to resize manually.
Regarding the maintenance failure, The maintenance cron job doesn't run that maintenance script, it only runs a job for cleaning up the tmpdir. But I think we should include this script as well. I think that the silent job failure should be reported on Galaxy, it's probable that it's affecting a lot of people.
But I think we should include this script as well.
Agreed, will look to where it should go.
I think that the silent job failure should be reported on Galaxy, it's probably that it's affecting a lot of people.
Sure, will do!
So, for the record here, changing .Values.postgresql.persistence.size
meant that the operator attempted a live resize of the disk, which corrupted it :-( and meant that I had to redo the deployment (no issues since this is a development setup for testing). But probably better to do that manually (downscaling setup manually to make sure is not done in "hot"). Maybe we should have small section about on the readme re-sizing disks.
Also, I was told by Nicola I think that the scripts/maintenance.sh --no-dry-run
doesn't actually attempt to delete anything on the database, so there must be another mechanism.
After only 10 executions or so of our single cell pipeline, the database disk got full:
is there any process in place that is cleaning the database with some regularity? Or should we suggest a higher default for the postgres disk space? I have been cleaning up disk space regularly, but I suspect that the database logs of jobs is not getting cleaned up as part of this process. None of the maintenance jobs take care of this?