NBISweden / aMeta

Ancient microbiome snakemake workflow
MIT License
19 stars 14 forks source link

Bug in Authentication_Score rule #161

Closed LeandroRitter closed 4 months ago

LeandroRitter commented 4 months ago

This seems to be a long-standing bug which popped up here and there (quite annoying) but did not cause serious problems, that is why we have not noticed it, but simply preferred to restart aMeta a few times. I believe, what happens is that the output of PMDtools (PMDscores.txt file) is computed independently of execution of score.R, please ave a look at these two rules:

rule PMD_scores:
    input:
        bam="results/AUTHENTICATION/{sample}/{taxid}/sorted.bam",
    output:
        scores="results/AUTHENTICATION/{sample}/{taxid}/PMDscores.txt",
    message:
        "PMD_scores: COMPUTING PMD SCORES"
    log:
        "logs/PMD_SCORES/{sample}_{taxid}.log",
    threads: 1
    conda:
        "../envs/malt.yaml"
    envmodules:
        *config["envmodules"]["malt"],
    shell:
        "(samtools view -h {input.bam} || true) | pmdtools --printDS > {output.scores}"

rule Authentication_Score:
    input:
        rma6="results/MALT/{sample}.trimmed.rma6",
        maltextractlog="results/AUTHENTICATION/{sample}/{taxid}/MaltExtract_output/log.txt",
        name_list="results/AUTHENTICATION/{sample}/{taxid}/name_list.txt",
    output:
        scores="results/AUTHENTICATION/{sample}/{taxid}/authentication_scores.txt",
    message:
        "Authentication_Score: COMPUTING AUTHENTICATION SCORES"
    params:
        exe=WORKFLOW_DIR / "scripts/score.R",
    log:
        "logs/AUTHENTICATION_SCORE/{sample}_{taxid}.log",
    threads: 1
    conda:
        "../envs/malt.yaml"
    envmodules:
        *config["envmodules"]["malt"],
    shell:
        "Rscript {params.exe} {input.rma6} $(dirname {input.maltextractlog}) {input.name_list} $(dirname {input.name_list}) &> {log};"

However, score.R uses PMDscores.txt, so it is essential that PMDscores rules is executed prior to Authentication_Score rule. What happens now is that Authentication_Score may start before the PMDscores.txt has been generated by the PMDscores rule. Therefore the PMDscores.txt file is missing at the moment of running score.R script, and the unfortunate

Error in if ((dim(df)[1] != 0) & (sum(df$V4 > 3)/dim(df)[1] > 0.1)) { :
  missing value where TRUE/FALSE needed
Execution halted

error occurs. I will try to fix this asap in the PR which I am working on now

ZoePochon commented 4 months ago

That would explain the "latency" problem. Thanks for looking into it Nikolay!

LeandroRitter commented 4 months ago

Fixed in the latest PR