cgpu commented 4 years ago

Problem

This job https://cloudos.lifebit.ai/app/jobs/5f248d56a79ea301123a7bc7 in the jax-anczukow-lab CloudOS workspace failed only because of running out of device space.

Solution

Increase temporarily for the testing the disk space from the conf/google.config file.

Implementation

How will we implement the solution?

1. Diagnose This happens when the results files from the working directories are being saved in the folder named results. As a conservative proxy of how much disk space you need, you can inspect the storage size accumulated in the working directory.

NOTE: The working directory is defined with -w in the nextflow run command. The working directory is saved and accessible ONLY when the job has been made to run as Resumable.

In CloudOS, you can find the nextflow run command in the first line of the Nextflow log file which can be accessed by clicking on view log.

Now to diagnose how much storage space the workdir occupies, grab the gs work path from the first line of the log (it will be in the end of the line defined as -w gs:// ....) and use gsutil to summarize the storage that all files within the work folder occupy:

# gsutil docs: https://cloud.google.com/storage/docs/gsutil/commands/du
# -s, summarize
# -h, human readable
# -sh, summarized and human readable
gsutil du -sh gs://...

# try a small subdirectory first, it takes a few seconds
gsutil du -sh gs://cloudosinputdata/deploit/teams/5ec2c818d663c5c2cd3bd991/users/5ec3dbf417954701035a2b7f/projects/5f248c84a79ea301123a76ae/jobs/5f248d56a79ea301123a7bc7/work/ff/b62ce6e6ce00f4ac843aa3598a02a0/logs/

# now try the real thing! (might take > 10min bc it's several terrabytes)
gsutil du -sh gs://cloudosinputdata/deploit/teams/5ec2c818d663c5c2cd3bd991/users/5ec3dbf417954701035a2b7f/projects/5f248c84a79ea301123a76ae/jobs/5f248d56a79ea301123a7bc7/work

2. Attempt fix

Update this line in conf/executors/google.config https://github.com/lifebit-ai/splicing-pipelines-nf/blob/2866b361b5d8cd5f54bbf6c3846aa667607b77cb/conf/executors/google.config#L2

- lifeSciences.bootDiskSize = 4000.GB 
+ lifeSciences.bootDiskSize = _000.GB

Depending on how much terrabytes the output from the command in step 1. adjust the lifeSciences.bootDiskSize variable value.

cgpu commented 4 years ago

@lmurba @angarb fyi, we might need help for executing some commands like du to collect info on average space per process type for the sumner runs, but we will explicitly ask so when needed and provide implementation details.

cgpu commented 4 years ago

@lmurba has provided info on this, pasting below:

# laura's data
4.4M    ./00/
1.8G    ./01/
4.3M    ./03/
523M    ./04/
45G     ./05/
23G     ./07/
26G     ./08/
28G     ./0c/
21G     ./10/
30M     ./11/
528M    ./13/
1.3G    ./15/
526M    ./16/
3.1G    ./17/
4.3M    ./18/
530M    ./19/
4.3M    ./1b/
24G     ./1d/
19G     ./20/
531M    ./23/
4.3M    ./26/
18G     ./27/
4.4M    ./29/
1.3G    ./2d/
4.3M    ./2f/
511M    ./30/
1.2G    ./34/
40G     ./37/
1.1G    ./38/
31G     ./3b/
4.3M    ./3c/
19G     ./3f/
530M    ./40/
23M     ./42/
26G     ./45/
1.1G    ./46/
4.3M    ./4a/
32G     ./4c/
27G     ./52/
528M    ./56/
54G     ./58/
21G     ./59/
531M    ./5d/
4.3M    ./5e/
19G     ./5f/
4.3M    ./62/
4.3M    ./63/
1.4G    ./65/
54G     ./66/
21G     ./67/
4.4M    ./6e/
4.3M    ./6f/
27G     ./79/
4.3M    ./7b/
20G     ./7e/
13M     ./7f/
23G     ./80/
4.3M    ./83/
1.1G    ./84/
30G     ./85/
20G     ./88/
517M    ./8b/
24G     ./8d/
29G     ./8e/
4.4M    ./91/
523M    ./97/
530M    ./9c/
1.1G    ./9d/
28G     ./9e/
4.4M    ./a3/
4.3M    ./a7/
23G     ./a8/
21G     ./aa/
4.4M    ./ab/
4.3M    ./ae/
537M    ./b0/
518M    ./b1/
526M    ./b3/
4.3M    ./b5/
22G     ./b7/
532M    ./ba/
21G     ./bc/
4.3M    ./bf/
1.2G    ./c0/
29G     ./c5/
19G     ./c8/
4.3M    ./cb/
4.3M    ./cc/
526M    ./ce/
4.3M    ./d1/
33G     ./d2/
30G     ./d7/
33G     ./e1/
18G     ./e6/
28G     ./e7/
4.3M    ./e8/
25G     ./e9/
27G     ./ea/
19G     ./ed/
20G     ./ef/
8.6M    ./f6/
1.4G    ./f8/
4.3M    ./f9/
22G     ./fa/
26G     ./fd/
1.1G    ./fe/

tcga data

1.4G    ./0d/
8.5G    ./15/
656M    ./17/
4.3M    ./2b/
10M     ./2f/
43G     ./34/
17G     ./36/
4.3M    ./52/
390M    ./56/
4.3M    ./5a/
13G     ./69/
29G     ./7d/
24G     ./8a/
8.5G    ./91/
16G     ./95/
474M    ./a5/
369M    ./d0/
4.3M    ./d4/
26G     ./d8/
4.3M    ./da/
5.9G    ./df/
6.6M    ./e1/
422M    ./f5/
4.3M    ./f6/
9.3G    ./fa/
12G     ./fb/

list_of_all_files_3_tcga_samples.txt list_of_all_files_24_laura_samples.txt

cgpu commented 4 years ago

So for allocating disk size, this is a bit tricky as there are 3 places to set this:

lifeSciences.bootDiskSize +
process {
    withName: 'star' {
        disk = "350 GB"
    }

will be used NOT for the master node but for google lifesciences machines.

So, the master node must be set via GUI from CloudOS, and everything in the config addresses only worker nodes.

Example: star takes up normally let’s say 5 GB for the sake of the example. then to achieve this and some extra slack do:

google {
    lifeSciences.bootDiskSize = 1.GB
}

+

process {
    withName: 'star' {
        disk = "4 GB"
    }
}

@sk-sahu @lmurba fyi what we are doing for this.

TheJacksonLaboratory / splicing-pipelines-nf

Failed to publish file: java.io.IOException: No space left on device #3 #183

Problem

Solution

Implementation

tcga data