Open aofarrel opened 3 years ago
The LD Pruning workflow's merge_gds task used to this workaround for the exact same reason as vcf-to-gds: There's an input array of files from a previous scattered task, and the config file will only support their addition if each file in that array is in the same directory. However, it turns out its Rscripts are opening the files as readonly mode, so symlinks will suffice.
All this time I was assuming that the issues I had with softlinks was unavoidable due to vcf-to-gds not working with them and due to all these people saying "softlinks don't exist in Google Cloud Storage," but I've since updated this issue and older comments to explain that the issue is actually down to permissions.
in:
ls -lha .
out:
total 28K drwxr-xrwx 9 topmed topmed 288 Jul 26 14:15 . drwxr-xrwx 5 topmed topmed 160 Jul 26 14:15 .. -rw-r--r-- 1 topmed topmed 6.3K Jul 26 14:15 script -rw-r--r-- 1 topmed topmed 461 Jul 26 14:15 script.background -rw-r--r-- 1 topmed topmed 499 Jul 26 14:15 script.submit -rw-r--r-- 1 topmed topmed 12 Jul 26 14:15 stderr -rw-r--r-- 1 topmed topmed 12 Jul 26 14:15 stderr.background -rw-r--r-- 1 topmed topmed 0 Jul 26 14:15 stdout -rw-r--r-- 1 topmed topmed 6 Jul 26 14:15 stdout.background
in:
ls -lha ../inputs
out:
total 0 drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 -1179415630 drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 -478351052 drwxr-xrwx 7 topmed topmed 224 Jul 26 14:15 . drwxr-xrwx 5 topmed topmed 160 Jul 26 14:15 .. drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 1318600307 drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 2019664885 drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 222713526
in:
ls -lha ../inputs/*
out:
../inputs/-1179415630: total 72K drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 . drwxr-xrwx 7 topmed topmed 224 Jul 26 14:15 .. -rw-r--r-- 2 topmed topmed 71K Jul 26 14:15 1KG_phase3_subset_chr1.gds
../inputs/-478351052: total 72K drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 . drwxr-xrwx 7 topmed topmed 224 Jul 26 14:15 .. -rw-r--r-- 2 topmed topmed 72K Jul 26 14:14 1KG_phase3_subset_chr3.gds
../inputs/1318600307: total 72K drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 . drwxr-xrwx 7 topmed topmed 224 Jul 26 14:15 .. -rw-r--r-- 2 topmed topmed 72K Jul 26 14:15 1KG_phase3_subset_chr2.gds
../inputs/2019664885: total 76K drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 . drwxr-xrwx 7 topmed topmed 224 Jul 26 14:15 .. -rw-r--r-- 2 topmed topmed 73K Jul 26 14:15 1KG_phase3_subset_chr20.gds
../inputs/222713526: total 60K drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 . drwxr-xrwx 7 topmed topmed 224 Jul 26 14:15 .. -rw-r--r-- 2 topmed topmed 58K Jul 26 14:14 1KG_phase3_subset_chrX.gds
after the creation of symbolic links, the execution directory has:
total 32K drwxr-xrwx 14 topmed topmed 448 Jul 26 14:22 . drwxr-xrwx 5 topmed topmed 160 Jul 26 14:22 .. lrwxr-xr-x 1 topmed topmed 123 Jul 26 14:22 1KG_phase3_subset_chr1.gds -> /bark-bark/vcftogds/46c1a6d5-73da-4a34-befd-6cee3cf19b05/call-unique_variant_id/inputs/872935859/1KG_phase3_subset_chr1.gds lrwxr-xr-x 1 topmed topmed 124 Jul 26 14:22 1KG_phase3_subset_chr2.gds -> /bark-bark/vcftogds/46c1a6d5-73da-4a34-befd-6cee3cf19b05/call-unique_variant_id/inputs/-924015500/1KG_phase3_subset_chr2.gds lrwxr-xr-x 1 topmed topmed 125 Jul 26 14:22 1KG_phase3_subset_chr20.gds -> /bark-bark/vcftogds/46c1a6d5-73da-4a34-befd-6cee3cf19b05/call-unique_variant_id/inputs/-222950922/1KG_phase3_subset_chr20.gds lrwxr-xr-x 1 topmed topmed 124 Jul 26 14:22 1KG_phase3_subset_chr3.gds -> /bark-bark/vcftogds/46c1a6d5-73da-4a34-befd-6cee3cf19b05/call-unique_variant_id/inputs/1574000437/1KG_phase3_subset_chr3.gds lrwxr-xr-x 1 topmed topmed 125 Jul 26 14:22 1KG_phase3_subset_chrX.gds -> /bark-bark/vcftogds/46c1a6d5-73da-4a34-befd-6cee3cf19b05/call-unique_variant_id/inputs/-2019902281/1KG_phase3_subset_chrX.gds -rw-r--r-- 1 topmed topmed 6.3K Jul 26 14:22 script -rw-r--r-- 1 topmed topmed 461 Jul 26 14:22 script.background -rw-r--r-- 1 topmed topmed 499 Jul 26 14:22 script.submit -rw-r--r-- 1 topmed topmed 1.7K Jul 26 14:22 stderr -rw-r--r-- 1 topmed topmed 12 Jul 26 14:22 stderr.background -rw-r--r-- 1 topmed topmed 1.9K Jul 26 14:22 stdout -rw-r--r-- 1 topmed topmed 6 Jul 26 14:22 stdout.background
On google, ls -lha ../*
finds:
ls -lha ../bin ../boot ../cromwell_root ../dev ../etc ../google ../home ../lib ../lib64 ../media ../mnt ../opt ../proc ../root ../run ../sbin ../srv ../sys ../tmp ../usr ../var
Within ../cromwell_root:
drwxrwxrwx 5 root root 4.0K Jul 26 14:53 .
drwxr-xr-x 1 root root 4.0K Jul 26 14:53 ..
-rw-r--r-- 1 root root 2.0K Jul 26 14:53 gcs_delocalization.sh
-rw-r--r-- 1 root root 1.7K Jul 26 14:53 gcs_localization.sh
-rw-r--r-- 1 root root 14K Jul 26 14:53 gcs_transfer.sh
drwxrwxrwx 2 root root 16K Jul 26 14:49 lost+found
-rw-r--r-- 1 root root 4.7K Jul 26 14:53 script
-rw-r--r-- 1 topmed topmed 245 Jul 26 14:53 stderr
-rw-r--r-- 1 topmed topmed 677 Jul 26 14:53 stdout
drwxrwxrwx 2 topmed topmed 4.0K Jul 26 14:53 tmp.78134170
drwxr-xr-x 3 root root 4.0K Jul 26 14:53 topmed_workflow_testing
The Terra errors for this commit reference ln
even though I'm not using ln
in the task.
ln: failed to access '/cromwell_root/*.gds': No such file or directory
However, outside of the task but within the script folder...
# hardlink or symlink all the files into the glob directory
( ln -L /cromwell_root/*.gds /cromwell_root/glob-5650d15b9bd471dc83ac35b7daef1c7b 2> /dev/null ) || ( ln /cromwell_root/*.gds /cromwell_root/glob-5650d15b9bd471dc83ac35b7daef1c7b )
Going through old branches and found terra-permissions-workaround from about ten months ago. This is essentially all it changed; it was just touching vcf-to-gds. I highly doubt the information in it is all true since that workaround isn't implemented with what's on main, but it may be worth recording...
In brief
Some tasks currently have a workaround wherein some input files are copied over twice. This results in these tasks requiring up to twice as much disk space as they would otherwise. Currently, disk size estimate calculations for these tasks account for this, so it is unlikely this will cause users to get an error. But this still results in slightly increased costs, so it's worth making note of.
Example
The first task of the vcf-to-gds workflow generates GDS files in a scattered task. The second task, which is not scattered, takes in those files as inputs to give them unique variant IDs.
This situation, wherein a scattered task passes in inputs to a non-scattered task, passes in each instance of the scattered task's outputs into a new folder. Let's say my scattered task runs on 5 vcf files, generating five gds files. My second task is passed in those gds files like this:
That is to say, each gds file now lives in its own folder within /inputs/.
This is problematic with how the R scripts use configuration files. These configuration files expect one line to represent a given pattern for an input file, such as
where the space is filled in with expected chromosome numbers by the script itself at runtime.
We have two options when referring to files like these when making configuration files: Either we pass in the path, or just a filename. If we pass in the full path, the resulting configuration file will be invalid, because every gds file has a different path due to each gds file living in a separate folder. If we pass in a filename, the resulting configuration file will technically be valid, but it will fail because the files strictly speaking do not exist in the working directory, but rather in some subfolder of /inputs/.
However, if we copy or symlink each of those input files into the working directory, we can use the filename method, because now files are actually where the R script expects them.
Where gdss is the array of input files from the previous scattered task.
However, this approach is non-functional on Terra -- a permission denied error is thrown. There are three root causes:
rw-r--r--
permissionschmod
ormv
are also not allowed on Terra in this context, so we need to duplicate the files to create a copy for which we have write permissions.Other workflows are able to use symlinks as they only open the inputs as read-only.