NBISweden / IgDiscover-legacy

Analyze antibody repertoires and discover new V genes from high-throughput sequencing reads
https://www.igdiscover.se
MIT License
17 stars 10 forks source link

Snakemake workflow problem with symlinks and temp() #107

Closed ressy closed 4 years ago

ressy commented 4 years ago

I just tried adding our forward primer to my config (there's no reverse in our read layout) and oddly enough it broke igdiscover's snakemake workflow at the rule dont_trim_reverse_primers. Digging into it a bit it looks like the problem is using symbolic links in conjunction with the temp() feature.

The last bit of the output looks like:

[Fri Feb 28 17:14:08 2020]
rule dont_trim_reverse_primers:
    input: reads/3-forward-primer-trimmed.fastq.gz
    output: reads/4-trimmed.fastq.gz
    jobid: 107
    wildcards: ext=fastq
    resources: time=1

Job counts:
        count   jobs
        1       dont_trim_reverse_primers
        1
Removing temporary output file reads/3-forward-primer-trimmed.fastq.gz.
[Fri Feb 28 17:14:09 2020]
Finished job 107.
2 of 93 steps (2%) done
WorkflowError:
File reads/4-trimmed.fastq.gz seems to be a broken symlink.
Total CPU time: 0h 0.96m
ERROR:

Looking at the rules and inputs/outputs that apply:

When it finishes dont_trim_reverse_primers it considers the temp file done with and removes it, which breaks the symlink needed by the last rule. It only seems to pick up on it when you have a rule that tries to then use that symlink, though.

This seemed like a weird situation so I made a minimal example with plain Snakemake to pin it down:

import os
import os.path

rule use_symlink:
    input: "link_to_a_file.txt"
    output: "link_stats.txt"
    shell: "stat {input} > {output}"

rule symlink_a_file:
    input: "a_file.txt"
    output: "link_to_a_file.txt"
    run:
        # igdiscover's relative symlink
        target = os.path.relpath(os.path.abspath(input[0]), start=os.path.dirname(output[0]))
        os.symlink(target, output[0])

rule make_a_file:
    output: temp("a_file.txt")
    shell: "touch {output}"

And this is what I see when calling snakemake:

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 56
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1       make_a_file
    1       symlink_a_file
    1       use_symlink
    3

[Fri Feb 28 17:08:23 2020]
rule make_a_file:
    output: a_file.txt
    jobid: 2

[Fri Feb 28 17:08:23 2020]
Finished job 2.
1 of 3 steps (33%) done

[Fri Feb 28 17:08:23 2020]
rule symlink_a_file:
    input: a_file.txt
    output: link_to_a_file.txt
    jobid: 1

Job counts:
    count   jobs
    1       symlink_a_file
    1
Removing temporary output file a_file.txt.
[Fri Feb 28 17:08:23 2020]
Finished job 1.
2 of 3 steps (67%) done
WorkflowError:
File link_to_a_file.txt seems to be a broken symlink.

This is with snakemake 5.9.1 and igdiscover 0.12.1.

marcelm commented 4 years ago

Cool, thanks for the detailed report! I’m taking some time off at the moment, but will look into this (and try to fix it) as soon as I’m back.

ressy commented 4 years ago

Great! No rush from my end. I removed temp(...) on trim_forward_primers's output for now and it finished OK.

marcelm commented 4 years ago

Thanks again for the good investigation! I’ve fixed the issue now by creating a hardlink instead of a symlink in the case that only forward_primers but no reverse_primers are specified. I wanted to keep the symlink in the "normal" case to make it clearer what the pipeline does. It’s more a work around than a real fix, but it should be good enough.

ressy commented 4 years ago

Thanks!