--dafaultDisk seems not working?

rrchai commented 2 years ago

Currently, the run_docker.cwl will try to pass two outputs to other steps:

input files (multiple downsampled data in 'csv'), which has total size ~ 800M.
prediction.tar.gz (compressed file containing multiple predicted data in 'csv'), which has total size ~ 50M

*Input files not only are used for submitted model, but also used for both validation and scoring steps

To achieve ^, what I currently implement is to: 1) copy the input files from the docker container of submission to a folder in working dir

https://github.com/Sage-Bionetworks-Challenges/Single-cell-RNA-seq-and-ATAC-seq-Data-Analysis-DREAM-Challenge-Infra/blob/0b68f8595d066682afe5726315ee1d8213c6c81f/run_docker.py#L154-L156 2) output as array of files: https://github.com/Sage-Bionetworks-Challenges/Single-cell-RNA-seq-and-ATAC-seq-Data-Analysis-DREAM-Challenge-Infra/blob/0b68f8595d066682afe5726315ee1d8213c6c81f/run_docker.cwl#L65-L69

However, I think it also double the size of files which eventually causes to exceed the max of disk size: 1G. And I got below error:

STDERR:   2022-03-23T14:26:24.839184622Z | Got exception '[Errno 28] No space left on   device:   '/var/lib/docker/volumes/workflow_orchestrator_shared/_data/78d9db5a-acc3-4bc6-bb68-6c40ab3f9c21/node-2ed4926a-612e-4a10-9bc3-144c45840618-1cc402dd0e11d5ae18db04a6de87223d/tmp7k239do2/8a212801-d026-4f5f-867e-2ae1851614de/tiat73dyz/tmp-outqudh1fui/predictions.tar.gz'   ->   '/var/lib/docker/volumes/workflow_orchestrator_shared/_data/78d9db5a-acc3-4bc6-bb68-6c40ab3f9c21/tmpqzhme832/files/for-job/kind-file_var_lib_docker_volumes_workflow_orchestrator_shared__data_78d9db5a-acc3-4bc6-bb68-6c40ab3f9c21_Single-cell-RNA-seq-and-ATAC-seq-Data-Analysis-DREAM-Challenge-Infra-develop_run_docker.cwl/instance-j85j0m8b/file-0j1iwqxk/predictions.tar.gz.5449db72-5ac2-4985-b7c6-6861896e5b7a.tmp.gz''   while copying 'file:///var/lib/docker/volumes/workflow_orchestrator_shared/_data/78d9db5a-acc3-4bc6-bb68-6c40ab3f9c21/node-2ed4926a-612e-4a10-9bc3-144c45840618-1cc402dd0e11d5ae18db04a6de87223d/tmp7k239do2/8a212801-d026-4f5f-867e-2ae1851614de/tiat73dyz/tmp-outqudh1fui/predictions.tar.gz'
-- | --
STDERR:   2022-03-23T14:26:24.839189632Z | ERROR:cwltool:Got exception '[Errno 28] No space left on device:   '/var/lib/docker/volumes/workflow_orchestrator_shared/_data/78d9db5a-acc3-4bc6-bb68-6c40ab3f9c21/node-2ed4926a-612e-4a10-9bc3-144c45840618-1cc402dd0e11d5ae18db04a6de87223d/tmp7k239do2/8a212801-d026-4f5f-867e-2ae1851614de/tiat73dyz/tmp-outqudh1fui/predictions.tar.gz'   ->   '/var/lib/docker/volumes/workflow_orchestrator_shared/_data/78d9db5a-acc3-4bc6-bb68-6c40ab3f9c21/tmpqzhme832/files/for-job/kind-file_var_lib_docker_volumes_workflow_orchestrator_shared__data_78d9db5a-acc3-4bc6-bb68-6c40ab3f9c21_Single-cell-RNA-seq-and-ATAC-seq-Data-Analysis-DREAM-Challenge-Infra-develop_run_docker.cwl/instance-j85j0m8b/file-0j1iwqxk/predictions.tar.gz.5449db72-5ac2-4985-b7c6-6861896e5b7a.tmp.gz''   while copying   'file:///var/lib/docker/volumes/workflow_orchestrator_shared/_data/78d9db5a-acc3-4bc6-bb68-6c40ab3f9c21/node-2ed4926a-612e-4a10-9bc3-144c45840618-1cc402dd0e11d5ae18db04a6de87223d/tmp7k239do2/8a212801-d026-4f5f-867e-2ae1851614de/tiat73dyz/tmp-outqudh1fui/predictions.tar.gz'
STDERR:   2022-03-23T14:26:24.839194662Z | WARNING:toil.fileStores.abstractFileStore:LOG-TO-MASTER: Job used more   disk than requested. Consider modifying the user script to avoid the chance   of failure due to incorrectly requested resources. Job   files/for-job/kind-CWLWorkflow/instance-nvxv8_q6/cleanup/file-gyl1_48w/stream   used 170.14% (1.7 GB [1828532224B] used, 1.0 GB [1074741824B] requested) at   the end of its run.

My initial thought is to increase requested size by changing --defaultDisk in TOIL_CLI_OPTIONS. However, it does not work, if I set --defaultDisk 10G:

STDERR: 2022-03-23T15:36:12.105495147Z     raise   InsufficientSystemResources('disk', disk, self.maxDisk, 
STDERR:   2022-03-23T15:36:12.105497768Z   toil.batchSystems.abstractBatchSystem.InsufficientSystemResources: The job   CWLWorkflow is requesting 2147483648 bytes of disk, more than the maximum of   796889088 bytes of disk that SingleMachineBatchSystem was configured with.   Scale is set to 1.

I also tried to add --maxDisk 10G, but the error/maximum of disk did not change. I wonder how to properly add the requested size?

Note:

I think another solution is to cp input files in a separated step, instead in run_docker. However, since we haven't decided how many input files will be, and total size of input files could be in the range of 4GB ~ 10GB. I am not sure if it will raise the error, even running in a separated step. So I think figuring out the reason why setting --defaultDisk fails ahead is helpful ~
Since the error coming from /var/lib/docker/volumes/..., so I tried to clean up the volume using docker volume prune and it solved the issue for now. I don't know if it is recurring after running multiple submissions over time.

rrchai commented 2 years ago

@vpchung Is there anything else I should change to increase the disk size?

rrchai commented 2 years ago

try to increase the instance volume first (to e.g. 50G)

rrchai commented 2 years ago

Setting --defaultDisk works, but setting --maxDisk is still not working. Closing it since with larger --defaultDisk solves the issue for this challenge.

Sage-Bionetworks-Challenges / Multi-seq-Data-Analysis-DREAM-Challenge-Infra

--dafaultDisk seems not working? #4