DataBiosphere / analysis_pipeline_WDL

Collection of WDL workflows based off the University of Washington TOPMed DCC Best Practices for GWAS. The WDL structure was based upon CWLs written by the Seven Bridges development team.
6 stars 3 forks source link

The Twice-Localized Workaround #2

Open aofarrel opened 3 years ago

aofarrel commented 3 years ago

In brief

Some tasks currently have a workaround wherein some input files are copied over twice. This results in these tasks requiring up to twice as much disk space as they would otherwise. Currently, disk size estimate calculations for these tasks account for this, so it is unlikely this will cause users to get an error. But this still results in slightly increased costs, so it's worth making note of.

Example

The first task of the vcf-to-gds workflow generates GDS files in a scattered task. The second task, which is not scattered, takes in those files as inputs to give them unique variant IDs.

This situation, wherein a scattered task passes in inputs to a non-scattered task, passes in each instance of the scattered task's outputs into a new folder. Let's say my scattered task runs on 5 vcf files, generating five gds files. My second task is passed in those gds files like this:

Screenshot 2021-04-09 at 3 34 00 PM

That is to say, each gds file now lives in its own folder within /inputs/.

This is problematic with how the R scripts use configuration files. These configuration files expect one line to represent a given pattern for an input file, such as

gds_file '1KG_phase3_subset_chr .gds'

where the space is filled in with expected chromosome numbers by the script itself at runtime.

We have two options when referring to files like these when making configuration files: Either we pass in the path, or just a filename. If we pass in the full path, the resulting configuration file will be invalid, because every gds file has a different path due to each gds file living in a separate folder. If we pass in a filename, the resulting configuration file will technically be valid, but it will fail because the files strictly speaking do not exist in the working directory, but rather in some subfolder of /inputs/.

However, if we copy or symlink each of those input files into the working directory, we can use the filename method, because now files are actually where the R script expects them.

BASH_FILES=(~{sep=" " gdss})
for BASH_FILE in ${BASH_FILES[@]};
do
    ln -s ${BASH_FILE}
done

Where gdss is the array of input files from the previous scattered task.

However, this approach is non-functional on Terra -- a permission denied error is thrown. There are three root causes:

  1. Terra does not give root permissions when executing workflows
  2. Cromwell tends to give localized input files rw-r--r-- permissions
  3. The Rscript in question uses openfn() with readonly=False

chmod or mv are also not allowed on Terra in this context, so we need to duplicate the files to create a copy for which we have write permissions.

BASH_FILES=(~{sep=" " gdss})
for BASH_FILE in ${BASH_FILES[@]};
do
    cp ${BASH_FILE} .
done

Other workflows are able to use symlinks as they only open the inputs as read-only.

aofarrel commented 3 years ago

The LD Pruning workflow's merge_gds task used to this workaround for the exact same reason as vcf-to-gds: There's an input array of files from a previous scattered task, and the config file will only support their addition if each file in that array is in the same directory. However, it turns out its Rscripts are opening the files as readonly mode, so symlinks will suffice.

aofarrel commented 3 years ago

All this time I was assuming that the issues I had with softlinks was unavoidable due to vcf-to-gds not working with them and due to all these people saying "softlinks don't exist in Google Cloud Storage," but I've since updated this issue and older comments to explain that the issue is actually down to permissions.

aofarrel commented 3 years ago

local

in:

  ls -lha .

out:

total 28K drwxr-xrwx 9 topmed topmed 288 Jul 26 14:15 . drwxr-xrwx 5 topmed topmed 160 Jul 26 14:15 .. -rw-r--r-- 1 topmed topmed 6.3K Jul 26 14:15 script -rw-r--r-- 1 topmed topmed 461 Jul 26 14:15 script.background -rw-r--r-- 1 topmed topmed 499 Jul 26 14:15 script.submit -rw-r--r-- 1 topmed topmed 12 Jul 26 14:15 stderr -rw-r--r-- 1 topmed topmed 12 Jul 26 14:15 stderr.background -rw-r--r-- 1 topmed topmed 0 Jul 26 14:15 stdout -rw-r--r-- 1 topmed topmed 6 Jul 26 14:15 stdout.background

in:

  ls -lha ../inputs

out:

total 0 drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 -1179415630 drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 -478351052 drwxr-xrwx 7 topmed topmed 224 Jul 26 14:15 . drwxr-xrwx 5 topmed topmed 160 Jul 26 14:15 .. drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 1318600307 drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 2019664885 drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 222713526

in:

  ls -lha ../inputs/*

out:

../inputs/-1179415630: total 72K drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 . drwxr-xrwx 7 topmed topmed 224 Jul 26 14:15 .. -rw-r--r-- 2 topmed topmed 71K Jul 26 14:15 1KG_phase3_subset_chr1.gds

../inputs/-478351052: total 72K drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 . drwxr-xrwx 7 topmed topmed 224 Jul 26 14:15 .. -rw-r--r-- 2 topmed topmed 72K Jul 26 14:14 1KG_phase3_subset_chr3.gds

../inputs/1318600307: total 72K drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 . drwxr-xrwx 7 topmed topmed 224 Jul 26 14:15 .. -rw-r--r-- 2 topmed topmed 72K Jul 26 14:15 1KG_phase3_subset_chr2.gds

../inputs/2019664885: total 76K drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 . drwxr-xrwx 7 topmed topmed 224 Jul 26 14:15 .. -rw-r--r-- 2 topmed topmed 73K Jul 26 14:15 1KG_phase3_subset_chr20.gds

../inputs/222713526: total 60K drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 . drwxr-xrwx 7 topmed topmed 224 Jul 26 14:15 .. -rw-r--r-- 2 topmed topmed 58K Jul 26 14:14 1KG_phase3_subset_chrX.gds

after the creation of symbolic links, the execution directory has:

total 32K drwxr-xrwx 14 topmed topmed 448 Jul 26 14:22 . drwxr-xrwx 5 topmed topmed 160 Jul 26 14:22 .. lrwxr-xr-x 1 topmed topmed 123 Jul 26 14:22 1KG_phase3_subset_chr1.gds -> /bark-bark/vcftogds/46c1a6d5-73da-4a34-befd-6cee3cf19b05/call-unique_variant_id/inputs/872935859/1KG_phase3_subset_chr1.gds lrwxr-xr-x 1 topmed topmed 124 Jul 26 14:22 1KG_phase3_subset_chr2.gds -> /bark-bark/vcftogds/46c1a6d5-73da-4a34-befd-6cee3cf19b05/call-unique_variant_id/inputs/-924015500/1KG_phase3_subset_chr2.gds lrwxr-xr-x 1 topmed topmed 125 Jul 26 14:22 1KG_phase3_subset_chr20.gds -> /bark-bark/vcftogds/46c1a6d5-73da-4a34-befd-6cee3cf19b05/call-unique_variant_id/inputs/-222950922/1KG_phase3_subset_chr20.gds lrwxr-xr-x 1 topmed topmed 124 Jul 26 14:22 1KG_phase3_subset_chr3.gds -> /bark-bark/vcftogds/46c1a6d5-73da-4a34-befd-6cee3cf19b05/call-unique_variant_id/inputs/1574000437/1KG_phase3_subset_chr3.gds lrwxr-xr-x 1 topmed topmed 125 Jul 26 14:22 1KG_phase3_subset_chrX.gds -> /bark-bark/vcftogds/46c1a6d5-73da-4a34-befd-6cee3cf19b05/call-unique_variant_id/inputs/-2019902281/1KG_phase3_subset_chrX.gds -rw-r--r-- 1 topmed topmed 6.3K Jul 26 14:22 script -rw-r--r-- 1 topmed topmed 461 Jul 26 14:22 script.background -rw-r--r-- 1 topmed topmed 499 Jul 26 14:22 script.submit -rw-r--r-- 1 topmed topmed 1.7K Jul 26 14:22 stderr -rw-r--r-- 1 topmed topmed 12 Jul 26 14:22 stderr.background -rw-r--r-- 1 topmed topmed 1.9K Jul 26 14:22 stdout -rw-r--r-- 1 topmed topmed 6 Jul 26 14:22 stdout.background

aofarrel commented 3 years ago

On google, ls -lha ../* finds:

ls -lha ../bin ../boot ../cromwell_root ../dev ../etc ../google ../home ../lib ../lib64 ../media ../mnt ../opt ../proc ../root ../run ../sbin ../srv ../sys ../tmp ../usr ../var

Within ../cromwell_root:

drwxrwxrwx 5 root   root   4.0K Jul 26 14:53 .
drwxr-xr-x 1 root   root   4.0K Jul 26 14:53 ..
-rw-r--r-- 1 root   root   2.0K Jul 26 14:53 gcs_delocalization.sh
-rw-r--r-- 1 root   root   1.7K Jul 26 14:53 gcs_localization.sh
-rw-r--r-- 1 root   root    14K Jul 26 14:53 gcs_transfer.sh
drwxrwxrwx 2 root   root    16K Jul 26 14:49 lost+found
-rw-r--r-- 1 root   root   4.7K Jul 26 14:53 script
-rw-r--r-- 1 topmed topmed  245 Jul 26 14:53 stderr
-rw-r--r-- 1 topmed topmed  677 Jul 26 14:53 stdout
drwxrwxrwx 2 topmed topmed 4.0K Jul 26 14:53 tmp.78134170
drwxr-xr-x 3 root   root   4.0K Jul 26 14:53 topmed_workflow_testing
aofarrel commented 3 years ago

The Terra errors for this commit reference ln even though I'm not using ln in the task.

ln: failed to access '/cromwell_root/*.gds': No such file or directory

However, outside of the task but within the script folder...

# hardlink or symlink all the files into the glob directory
( ln -L /cromwell_root/*.gds /cromwell_root/glob-5650d15b9bd471dc83ac35b7daef1c7b 2> /dev/null ) || ( ln /cromwell_root/*.gds /cromwell_root/glob-5650d15b9bd471dc83ac35b7daef1c7b )
aofarrel commented 2 years ago

Going through old branches and found terra-permissions-workaround from about ten months ago. This is essentially all it changed; it was just touching vcf-to-gds. I highly doubt the information in it is all true since that workaround isn't implemented with what's on main, but it may be worth recording...

Screen Shot 2022-07-12 at 11 56 56 AM