ENCODE-DCC / caper

Cromwell/WDL wrapper for Python
MIT License
52 stars 18 forks source link

Relative paths to data files are not accessible by Singularity containers #184

Open jgoodson opened 1 year ago

jgoodson commented 1 year ago

This error occurs while using the atac-seq-pipeline but I do not believe this is specific to that. When using Singularity, relative paths to local input files do not get included in the Singularity bindpath. This causes jobs to fail when the input files are symlinked back to the original location (the default first-priority for localization) as the base directory is not bound into the container. Changing the paths to absolute paths fixes this issue.

This error will look something like:

Traceback (most recent call last):
  File "/software/atac-seq-pipeline/src/encode_task_trim_adapter.py", line 214, in <module>
    main()
  File "/software/atac-seq-pipeline/src/encode_task_trim_adapter.py", line 157, in main
    args.adapters[i][0] = detect_most_likely_adapter(fastqs[0])
  File "/software/atac-seq-pipeline/src/detect_adapter.py", line 49, in detect_most_likely_adapter
    fname)
  File "/software/atac-seq-pipeline/src/detect_adapter.py", line 26, in detect_adapters_and_cnts
    with open_gz(fname) as fp:
  File "/software/atac-seq-pipeline/src/detect_adapter.py", line 16, in open_gz
    return gzip.open(fname) if fname.endswith('.gz') else open(fname, 'rb')
  File "/usr/lib/python3.6/gzip.py", line 53, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)
  File "/usr/lib/python3.6/gzip.py", line 163, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/gpfs/gsfs8/users/goodsonjr/encode-atac/atac/a72c0a26-b2e0-4bdc-8e63-31a52e65a332/call-align/shard-1/attempt-2/inputs/135737268/ENCFF641SFZ.subsampled.400.fastq.gz'

In this example, that file is a symlink to a file in /gpfs/gsfs8/users/goodsonjr/encode-atac/input/. The submitted script invokes singularity with this command:

singularity exec --cleanenv --home=/gpfs/gsfs8/users/goodsonjr/encode-atac/atac/a72c0a26-b2e0-4bdc-8e63-31a52e65a332/call-align/shard-1 --bind=/fdb/encode-atac-seq-pipeline/v3/hg38,/vf/db/encode-atac-seq-pipeline/v3, https://encode-pipeline-singularity-image.s3.us-west-2.amazonaws.com/atac-seq-pipeline_v2.2.0.sif /bin/bash /gpfs/gsfs8/users/goodsonjr/encode-atac/atac/a72c0a26-b2e0-4bdc-8e63-31a52e65a332/call-align/shard-1/attempt-2/execution/script

This generated bindpaths for the atac.genome_tsv file as it was an absolute path, but the relative paths to the FastQ files are discarded. Since the original files aren't in --bind or the --home path, the container cannot read the file.

Details: Looking at the Caper code, it runs caper.singularity.find_bindpath() on the input JSON file to determine what paths to bind-mount. This function calls autouri.AbsPath to determine if it is a valid path. It then uses some logic to determine which parent directories to bind-mount. Since this function takes the relative paths and directly generates an autouri AbsPath, the relative paths don't generate valid URIs, and won't be included in the bind-path generation logic. I get a comparable result when calling this function directly, this conditional:

https://github.com/ENCODE-DCC/caper/blob/2ff0999285490984e43fa8d0f74f636575f973c8/caper/singularity.py#L43-L45

evaluates to False when fed a relative path. This means the path won't be included in all_dirnames and won't contribute to the bindpath.

This issue does not seem to arise with the plain local backend without Slurm. I haven't figured out why, but when using Slurm Caper creates symlinks in the workflow run directory, while with the local backend they get copied or hardlinked, despite the generated backend.conf having the same order for backend.providers.Local.config.filesystem.local.localization.

I looked but was unable to find any documentation concerning absolute vs relative paths, and the descriptions of the input JSON format use either web URIs or relative local paths. I am not sure what to suggest, although using os.path.abspath() to convert relative paths to absolute before generating the autouri.AbsPath might potentially resolve this.

leepc12 commented 1 year ago

I will add documentation about absolute paths in README.

Caper's localization engine autouri cannot distinguish between relative path and plain string, so was not able to add it to Singularity's bindpath. Currently, it's recommended to use absolute paths only in an input JSON particularly for Singularity.

Thanks for reporting, I will look into this and fix it soon. Please use absolute paths until it's fixed.

xk42 commented 1 year ago

This is affecting our environment as well. Since cromwell localization by default does hard link, soft link and file copy with cromwell. Most institutes have labs and multiple filesystems that shares data and in this case, singularity container seems to be unable to access data that are localized by cromwell and turned into relative paths.

sidwekhande commented 11 months ago

We faced this issue as well, and using absolute paths did not work for us. To solve this, I created a custom backend and changed the order of the localization list to:

localization = [
    "hard-link"
    "copy"
    "soft-link" 
 ]

and then passed the custom backend file using --backend-file via cmd line.

leepc12 commented 11 months ago

@sidwekhande Are you using slurm backend? It's weird that changing the order fixed the problem. I think you can fix it by not including linked (symlinked or hardlinked) files in input JSON.

For example, if you have an original file at /home/me/original/genome.tsv and it's symlinked to /home/me/linked/genome.tsv. You should not use /home/me/linked/genome.tsv. Simply using original paths in input JSON will fix the problem.

Make sure that all files defined in genome.tsv are not linked (soft or hard) either. e.g. fasta, genome size files, bowtie2 indices.

I will add this to my ToDo list. I will need to edit caper.singularity.find_bindpath() function to recognize linked files and find original ones (recursively for files in genome TSV too).

{
   "atac.genome_tsv": "/home/me/original/genome.tsv"
}

Please let me know if u can run it without adding your own --backend-file.