EpiDiverse / wgbs

The EpiDiverse Whole Genome Bisulfite Sequencing Pipeline, implemented with Nextflow
MIT License
6 stars 1 forks source link

Instructions for local install? #8

Open kubu4 opened 2 years ago

kubu4 commented 2 years ago

I'm trying to run the pipeline test via the Singularity image on our university's computing cluster, which doesn't have internet access when executing jobs.

I've downloaded all the of the input files listed in test.config. I've also downloaded the Singularity image (singularity pull docker://epidiverse/wgbs:1.0) and changed the nextflow.config file to specify the Singularity image location, like so:

// -profile singularity
    singularity {
        includeConfig "${baseDir}/config/base.config"
        singularity.enabled = true
        process.container = '/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/work/singularity/wgbs_1.0.sif'
    }

That seemed like that should be all that was needed, but when I execute the test command (NXF_VER=20.07.1 /gscratch/srlab/programs/nextflow-21.10.6-all run /gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/wgbs-1.0 -profile test,singularity), it fails with this error:

executor >  local (10)
[c4/79070c] process > INDEX:erne_bs5_indexing        [100%] 1 of 1 ✔
[30/202688] process > INDEX:segemehl_indexing        [100%] 1 of 1 ✔
[07/dc2230] process > WGBS:read_trimming (sampleB)   [100%] 8 of 8, failed: 8...
[-        ] process > WGBS:read_merging              -
[-        ] process > WGBS:fastqc                    -
[-        ] process > WGBS:erne_bs5                  -
[-        ] process > WGBS:segemehl                  -
[-        ] process > WGBS:erne_bs5_processing       -
[-        ] process > WGBS:segemehl_processing       -
[-        ] process > WGBS:bam_merging               -
[-        ] process > WGBS:bam_subsetting            -
[-        ] process > WGBS:bam_filtering             -
[-        ] process > WGBS:bam_statistics            -
[-        ] process > CALL:bam_processing            -
[-        ] process > CALL:Picard_MarkDuplicates     -
[-        ] process > CALL:MethylDackel              -
[-        ] process > CALL:conversion_rate_estima... -

Pipeline execution summary
---------------------------
Name         : infallible_mccarthy
Profile      : test,singularity
Launch dir   : /gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test
Work dir     : /gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/work
Status       : failed
Error report : Error executing process > 'WGBS:read_trimming (sampleA)'

Caused by:
  Process `WGBS:read_trimming (sampleA)` terminated with an error exit status (1)

Command executed:

  mkdir fastq fastq/logs
  cutadapt -j 2 -a AGATCGGAAGAGC -A AGATCGGAAGAGC \
  -q 20 -m 36 -O 3 \
  -o fastq/merge.null \
  -p fastq/merge.g null g \
  > fastq/logs/cutadapt.sampleA.merge.log 2>&1

Command exit status:
  1

Command output:
  (empty)

Work dir:
  /gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/work/12/6ee9cc7a7372a97f34f21a4f79efb3

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

Error executing process > 'WGBS:read_trimming (sampleA)'

Caused by:
  Process `WGBS:read_trimming (sampleA)` terminated with an error exit status (1)

Command executed:

  mkdir fastq fastq/logs
  cutadapt -j 2 -a AGATCGGAAGAGC -A AGATCGGAAGAGC \
  -q 20 -m 36 -O 3 \
  -o fastq/merge.null \
  -p fastq/merge.g null g \
  > fastq/logs/cutadapt.sampleA.merge.log 2>&1

Command exit status:
  1

Command output:
  (empty)

Work dir:
  /gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/work/12/6ee9cc7a7372a97f34f21a4f79efb3

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

When I look at the Cutadapt log file, this is what is shown:

cat cutadapt.sampleA.merge.log 
This is cutadapt 2.10 with Python 3.6.7
Command line parameters: -j 2 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -q 20 -m 36 -O 3 -o fastq/merge.null -p fastq/merge.g null g
Processing reads on 2 cores in paired-end mode ...
ERROR: Traceback (most recent call last):
  File "/opt/conda/envs/wgbs/lib/python3.6/site-packages/cutadapt/pipeline.py", line 477, in run
    with xopen(self.file, 'rb') as f:
  File "/opt/conda/envs/wgbs/lib/python3.6/site-packages/xopen/__init__.py", line 407, in xopen
    return open(filename, mode)
IsADirectoryError: [Errno 21] Is a directory: 'null'

ERROR: Traceback (most recent call last):
  File "/opt/conda/envs/wgbs/lib/python3.6/site-packages/cutadapt/pipeline.py", line 477, in run
    with xopen(self.file, 'rb') as f:
  File "/opt/conda/envs/wgbs/lib/python3.6/site-packages/xopen/__init__.py", line 407, in xopen
    return open(filename, mode)
IsADirectoryError: [Errno 21] Is a directory: 'null'

ERROR: Traceback (most recent call last):
  File "/opt/conda/envs/wgbs/lib/python3.6/site-packages/cutadapt/pipeline.py", line 540, in run
    raise e
IsADirectoryError: [Errno 21] Is a directory: 'null'

Traceback (most recent call last):
  File "/opt/conda/envs/wgbs/bin/cutadapt", line 10, in <module>
    sys.exit(main())
  File "/opt/conda/envs/wgbs/lib/python3.6/site-packages/cutadapt/__main__.py", line 855, in main
    stats = r.run()
  File "/opt/conda/envs/wgbs/lib/python3.6/site-packages/cutadapt/pipeline.py", line 770, in run
    raise e
IsADirectoryError: [Errno 21] Is a directory: 'null'

Did I miss something that needs to be setup for a local install to run properly?

bio15anu commented 2 years ago

Thanks for opening this issue! There seems to be something going on with the "Command executed:" section in the error message. Specifically here:

  -o fastq/merge.null \
  -p fastq/merge.g null g \

where "null" should reflect the reads variable from L48-L49 in wgbs.nf

 -o fastq/${params.merge ? "${readtype}." : ""}${reads[0]} \\
 -p fastq/${params.merge ? "${readtype}." : ""}${reads[1]} ${reads} \\

I suspect the issue here is that we need to create a new test.config file for running the test profile offline. Can you provide some more information as to what you did here, exactly? Did you modify the paths in the existing test.config file?

bio15anu commented 2 years ago

As an aside to this issue, I just wanted to point out that during a typical pipeline run it is not necessary to have an open internet connection. If your intention is to submit to a queuing system, for example, which perhaps sends the job to another node where there is no internet connection, it should be enough to have already pulled the pipeline normally from the login node. You will get a local copy of the pipeline in ~/.nextflow/assets which is the first place nextflow will look for the pipeline whenever you run it.

Is that relevant for your use case at all?

kubu4 commented 2 years ago

Thanks for looking into this. It is much appreciated!

Did you modify the paths in the existing test.config file?

Gah! Yes! Sorry for not including that!! Here's what the modified test.config file looks like:

/*
 * -------------------------------------------------
 *  Nextflow config file for running tests
 * -------------------------------------------------
 * Defines bundled input files and everything required
 * to run a fast and simple test. Use as follows:
 *   nextflow run epidivere/wgbs -profile test
 */

params {

    // enable all steps
    input = "test profile"
    merge = true
    INDEX = true
    trim = true
    fastqc = true
    unique = true

    // genome reference
    reference = "/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/genome/genome.fa"

    // set readPaths parameter (only available in test profile)
    readPaths = [
    ['sampleA', 'input', '/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/sampleA_1.fastq.gz','/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/sampleA_2.fastq.gz'],
    ['sampleB', 'input', '/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/sampleB_1.fastq.gz','/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/sampleB_2.fastq.gz']
    ]

    // set mergePaths parameter (only available in test profile)
    mergePaths = [
    ['sampleA', 'merge', '/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/merge/sampleA_1.fastq.gz','/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/merge/sampleA_2.fastq.gz'],
    ['sampleB', 'merge', '/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/merge/sampleB_1.fastq.gz','/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/merge/sampleB_2.fastq.gz']
    ]
}

As an aside to this issue, I just wanted to point out that during a typical pipeline run it is not necessary to have an open internet connection. If your intention is to submit to a queuing system, for example, which perhaps sends the job to another node where there is no internet connection, it should be enough to have already pulled the pipeline normally from the login node. You will get a local copy of the pipeline in ~/.nextflow/assets which is the first place nextflow will look for the pipeline whenever you run it.

Is that relevant for your use case at all?

Yeah, we'd be running on a high performance computing cluster (uses SLURM job manager). Was just trying to confirm that the install and using Singularity on the computing nodes would work properly. Figured troubleshooting would be easier if test ran successfully.

bio15anu commented 2 years ago

In this new test.config file for running offline, it looks like you've lost the nested tuples in both readPaths and mergePaths.

So for example this:

readPaths = [
    ['sampleA', 'input', '/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/sampleA_1.fastq.gz','/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/sampleA_2.fastq.gz'],
    ['sampleB', 'input', '/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/sampleB_1.fastq.gz','/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/sampleB_2.fastq.gz']
    ]

should be changed to this:

readPaths = [
    ['sampleA', 'input', ['/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/sampleA_1.fastq.gz','/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/sampleA_2.fastq.gz']],
    ['sampleB', 'input', ['/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/sampleB_1.fastq.gz','/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/sampleB_2.fastq.gz']]
    ]
bio15anu commented 2 years ago

By the way, I am very happy to assist you in writing a configuration profile for running your nextflow pipelines with SLURM. Nextflow is able to integrate very nicely with such resource management software, where it can automatically submit each process as a job in your queue system for example. Please feel free to post a new issue requesting help with this and I will try to tailor it for your system as best as I can!