BIMSBbioinfo / pigx_rnaseq

Bulk RNA-seq Data Processing, Quality Control, and Downstream Analysis Pipeline
GNU General Public License v3.0
20 stars 11 forks source link

Pandoc error when running using cluster submission #137

Open alexg9010 opened 2 months ago

alexg9010 commented 2 months ago

I was running the test data in a cluster environment.

I had to extend the memory limit for counts_from_SALMON in tests/settings.yaml:

execution:
  submit-to-cluster: yes
  rules:
    counts_from_SALMON:
      threads: 1
      memory: 2000

Then run via

export PYTHONPATH=$GUIX_PYTHONPATH
export PIGX_UNINSTALLED="1" ; ./pigx-rnaseq -s tests/settings.yaml tests/sample_sheet.csv

The pipeline failed for the report generating jobs:

[...]
Error in rule report1:
    jobid: 40
    output: /fast/home/a/agosdsc/projects/pigx/pigx_rnaseq/tests/output/report/hisat2/analysis1.deseq.report.html, /fast/home/a/agosdsc/projects/pigx/pigx_rnaseq/tests/output/report/hisat2/analysis1.deseq_results.tsv
    log: /fast/home/a/agosdsc/projects/pigx/pigx_rnaseq/tests/output/logs/hisat2/analysis1.report.log (check log file(s) for error message)
    shell:
        /gnu/store/b0skxv953fpsdg79cs4g9qz78ds6pvlz-profile/bin/Rscript --vanilla /fast/home/a/agosdsc/projects/pigx/pigx_rnaseq/scripts/runDeseqReport.R --logo=/fast/home/a/agosdsc/projects/pigx/pigx_rnaseq/images/Logo_PiGx.png --prefix='analysis1' --reportFile=/fast/home/a/agosdsc/projects/pigx/pigx_rnaseq/scripts/deseqReport.Rmd --countDataFile=/fast/home/a/agosdsc/projects/pigx/pigx_rnaseq/tests/output/feature_counts/raw_counts/hisat2/counts.tsv --colDataFile=/fast/home/a/agosdsc/projects/pigx/pigx_rnaseq/tests/output/colData.tsv --gtfFile=/fast/home/a/agosdsc/projects/pigx/pigx_rnaseq/tests/sample_data/sample.gtf --caseSampleGroups='HBR' --controlSampleGroups='UHR' --covariates=''  --workdir=/fast/home/a/agosdsc/projects/pigx/pigx_rnaseq/tests/output/report/hisat2 --organism='' --description='This analysis is part of the pigx-rnaseq build-time tests.' --selfContained='True' >> /fast/home/a/agosdsc/projects/pigx/pigx_rnaseq/tests/output/logs/hisat2/analysis1.report.log 2>&1
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    cluster_jobid: Your job 7042317 ("snakejob.report1.40.sh") has been submitted

Error executing rule report1 on cluster (jobid: 40, external: Your job 7042317 ("snakejob.report1.40.sh") has been submitted, jobscript: /fast/home/a/agosdsc/projects/pigx/pigx_rnaseq/tests/output/.snakemake/tmp.a2y6tlv1/snakejob.report1.40.sh). For error details see the cluster log and the log files of the involved rule(s).
[...]

This is the content of the log:

 $ cat  /fast/home/a/agosdsc/projects/pigx/pigx_rnaseq/tests/output/logs/salmon/analysis1.report.salmon.genes.log

arguments: --logo=/gnu/store/1nwmyp16abzi3yhvk43g0m21plcbgw5g-pigx-rnaseq-0.1.0/share/pigx_rnaseq/Logo_PiGx.png --prefix=D3_VS_WILDTYPE.salmon.transcripts --reportFile=/gnu/store/1nwmyp16abzi3yhvk43g0m21plcbgw5g-pigx-rnaseq-0.1.0/libexec/pigx_rnaseq/scripts/deseqReport.Rmd --countDataFile=/fast/AG_Akalin/agosdsc/projects/testing_swaroop/role_of_pde3a_in_htnb/feature_counts/raw_counts/salmon/counts_from_SALMON.transcripts.tsv --colDataFile=/fast/AG_Akalin/agosdsc/projects/testing_swaroop/role_of_pde3a_in_htnb/colData.tsv --gtfFile=/fast/AG_Klussmann/swaroop/rat_annotation/gtf/Rattus_norvegicus.mRatBN7.2.111.gtf --caseSampleGroups=D3_MUTANT --controlSampleGroups=WILD_TYPE --covariates= --workdir=/fast/AG_Akalin/agosdsc/projects/testing_swaroop/role_of_pde3a_in_htnb/report/salmon --organism= --description=Comparison of D3 mutatants vs wildtype --selfContained=True
setting working directory to  /fast/AG_Akalin/agosdsc/projects/testing_swaroop/role_of_pde3a_in_htnb/report/salmon
Error: pandoc version 1.12.3 or higher is required and was not found (see the help page ?rmarkdown::pandoc_available).
Execution halted

I see this pandoc related error:

Error: pandoc version 1.12.3 or higher is required and was not found (see the help page ?rmarkdown::pandoc_available).
rekado commented 2 months ago

yikes. We use pandoc 2. Perhaps something broke in the rmarkdown check for pandoc? I'll take a look.

rekado commented 2 months ago

I just did this and it works fine:

guix shell --container r-minimal r-rmarkdown -- R -e 'rmarkdown::pandoc_available("2.11")'

So, that's not it.

The reason is likely that you're using PiGx from a checkout. I would assume that on the cluster nodes you don't actually have Pandoc. What does the tools section of the settings file look like? Using PIGX_UNINSTALLED is also a red flag.

alexg9010 commented 2 months ago

This is the tools section from the generated `config.json', the test settings file does not contain any tool specification:

  "tools": {
        "Rscript": {
            "args": "--vanilla",
            "executable": "/gnu/store/b0skxv953fpsdg79cs4g9qz78ds6pvlz-profile/bin/Rscript"
        },
        "bamCoverage": {
            "args": "--normalizeUsing BPM --numberOfProcessors 2",
            "executable": "/gnu/store/b0skxv953fpsdg79cs4g9qz78ds6pvlz-profile/bin/bamCoverage"
        },
        "fastp": {
            "args": "--adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT",
            "executable": "/gnu/store/b0skxv953fpsdg79cs4g9qz78ds6pvlz-profile/bin/fastp"
        },
        "gunzip": {
            "args": "",
            "executable": "/gnu/store/b0skxv953fpsdg79cs4g9qz78ds6pvlz-profile/bin/gunzip"
        },
        "hisat2": {
            "args": "--fast",
            "executable": "/gnu/store/b0skxv953fpsdg79cs4g9qz78ds6pvlz-profile/bin/hisat2"
        },
        "hisat2-build": {
            "args": "",
            "executable": "/gnu/store/b0skxv953fpsdg79cs4g9qz78ds6pvlz-profile/bin/hisat2-build"
        },
        "megadepth": {
            "args": "",
            "executable": "/gnu/store/b0skxv953fpsdg79cs4g9qz78ds6pvlz-profile/bin/megadepth"
        },
        "multiqc": {
            "args": "",
            "executable": "/gnu/store/b0skxv953fpsdg79cs4g9qz78ds6pvlz-profile/bin/multiqc"
        },
        "salmon_index": {
            "args": "index",
            "executable": "/gnu/store/b0skxv953fpsdg79cs4g9qz78ds6pvlz-profile/bin/salmon"
        },
        "salmon_quant": {
            "args": "quant",
            "executable": "/gnu/store/b0skxv953fpsdg79cs4g9qz78ds6pvlz-profile/bin/salmon"
        },
        "samtools": {
            "args": "",
            "executable": "/gnu/store/b0skxv953fpsdg79cs4g9qz78ds6pvlz-profile/bin/samtools"
        },
        "sed": {
            "args": "",
            "executable": "/gnu/store/b0skxv953fpsdg79cs4g9qz78ds6pvlz-profile/bin/sed"
        },
        "star_index": {
            "args": "",
            "executable": "/gnu/store/b0skxv953fpsdg79cs4g9qz78ds6pvlz-profile/bin/STAR"
        },
        "star_map": {
            "args": "",
            "executable": "/gnu/store/b0skxv953fpsdg79cs4g9qz78ds6pvlz-profile/bin/STAR"
        }
    }
}
alexg9010 commented 2 months ago

Seems like the way to set pandoc path is done via rmarkdown::find_pandoc (see https://bookdown.org/yihui/rmarkdown-cookbook/install-pandoc.html)

The purpose of find_pandoc() is to

Searches for the pandoc executable in a few places and use the highest version found, unless a specific version is requested. Source: https://pkgs.rstudio.com/rmarkdown/reference/find_pandoc.html

Specifcally it searches the paths given by "RSTUDIO_PANDOC", "PATH" (via rmarkdown:::find_program() ) and the folder "~/opt/pandoc":

https://github.com/rstudio/rmarkdown/blob/ee69d59f8011ad7b717a409fcbf8060d6ffc4139/R/pandoc.R#L663C1-L668C34

There is no "~/opt/pandoc", but exporting "RSTUDIO_PANDOC" via qsub is possible by updating the qsub template:

qsub-template.sh.in:

#!@GNUBASH@
# properties = {properties}

if [ 'yes' = '@capture_environment@' ]; then
    export R_LIBS_SITE="@R_LIBS_SITE@"
    export PYTHONPATH="@PYTHONPATH@"
        export RSTUDIO_PANDOC="@PANDOC@"
fi

env

{exec_job}

checking for used pandoc version by adding this chunk to rule report1:

{RSCRIPT_EXEC} -e 'rmarkdown::find_pandoc()'

We can inspect the jobs environment by checking the job log output:

$less tests/output/snakejob.report1.40.sh.o7043287

[...]
RSTUDIO_PANDOC=/gnu/store/b0skxv953fpsdg79cs4g9qz78ds6pvlz-profile/bin/pandoc
[...]
$version
[1] ‘0’

$dir
NULL

So it seems no matching dir was found.

Running the function find_pandoc in guix environment -l guix.scm in the pigx folder works:

> rmarkdown::find_pandoc()
sh: warning: setlocale: LC_ALL: cannot change locale (en_US.utf-8)
$version
[1] '2.19.2'

$dir
[1] "/gnu/store/b0skxv953fpsdg79cs4g9qz78ds6pvlz-profile/bin"
rekado commented 2 months ago

My reading of pandoc.R tells me that RSTUDIO_PANDOC is meant to be a directory. Give it the dirname of @PANDOC@ instead.

alexg9010 commented 2 months ago

Thanks, using dirname of pandoc works.

I will try to fix this in pigx-common.