cbg-ethz / V-pipe

V-pipe is a pipeline designed for analysing NGS data of short viral genomes
https://cbg-ethz.github.io/V-pipe/
Apache License 2.0
132 stars 46 forks source link

Workflow error from V-pipe dry run #160

Open robertsap opened 2 months ago

robertsap commented 2 months ago

Describe the bug I am analyzing data from plant (cherry) samples hoping to determine viral quasispecies of Little Cherry Virus I set up my v-pipe workflow based on the sars-cov2 tutorial However, when I attempt to run v-pipe, either through a dry run or fully, I get a "workflow error"

My questions: I'm curious why there is a "missing input file" (for sam2bam and gunzip). I was not instructed to give any other files except the fastq. Is the workflow error a bug, or something I am missing in my input/config files?

To Reproduce

  1. V-pipe configuration file used
    
    general:
    virus_base_config: ""

input: datadir: /samples/ samples_file: samples.tsv reference: "{VPIPE_BASEDIR}/resources/LChV-2/reference.fasta" genes_gff: "{VPIPE_BASEDIR}/../resources/LChV-2/genomic.gff" read_length: 150

output: datadir: /results/ trim_primers: false snv: true local: true global: true visualization: true diversity: true QA: true upload: false dehumanized_raw_reads: false

2. Samples TSV file used 

samples ├── 22-L147 │   └── 230309 │   └── raw_data │   ├── 22-L147_S3_R1.fastq │   └── 22-L147_S3_R2.fastq └── 22-L801 └── 230309 └── raw_data ├── 22-L801_S14_R1.fastq └── 22-L801_S14_R2.fastq

6 directories, 4 files

vi samples.tsv 22-L147 22-L147 22-L801 22-L801

3. Commands executed 

./vpipe --dryrun

4. See error

Building DAG of jobs... WorkflowError: MissingInputException: Missing input files for rule sam2bam: output: /results/22-L147/22-L147/alignments/REF_aln.bam, /results/22-L147/22-L147/alignments/REF_aln.bam.bai wildcards: file=/results/22-L147/22-L147/alignments/REF_aln affected files: /results/22-L147/22-L147/alignments/REF_aln.sam WorkflowError: WorkflowError: MissingInputException: Missing input files for rule gunzip: output: /results/22-L147/22-L147/extracted_data/R1.fastq wildcards: file=/results/22-L147/22-L147/extracted_data/R1, ext=fastq affected files: /results/22-L147/22-L147/extracted_data/R1.fastq.gz MissingInputException: Missing input files for rule gunzip: output: /results/22-L147/22-L147/extracted_data/R1.fastq wildcards: file=/results/22-L147/22-L147/extracted_data/R1, ext=fastq affected files: /results/22-L147/22-L147/extracted_data/R1.fastq.gz CyclicGraphException: Cyclic dependency on rule convert_to_ref.



**Expected behavior**
Due to following the setup tutorial, and sars-cov2 tutorial, I expected to get an output message indicating I either had everything in the right place in my config file, or indicating where I would need to make changes 

**Desktop**
 - OS: Linux
 - Version: not sure? Installed using the quick install script from the tutorial on August 13th 2024 
DrYak commented 1 month ago

Hi (and sorry for the slow answer, I was on holiday).

I notice that you're giving absolute paths in you configuration file (begining with a slash /):

input:
   datadir: /samples/

# …
output:
    datadir: /results/

And thus, V-pipe is trying to read and write file on the root directory of your workstation:

WorkflowError:
MissingInputException: Missing input files for rule sam2bam:
    output: /results/22-L147/22-L147/alignments/REF_aln.bam, /results/22-L147/22-L147/alignments/REF_aln.bam.bai
    wildcards: file=/results/22-L147/22-L147/alignments/REF_aln
    affected files:
       /results/22-L147/22-L147/alignments/REF_aln.sam

see directories /results/22-L147/22-L147/… above.

I presume you should be using paths relative to your current working directory, like the tutorials do, so without a leading /, e.g.:

input:
   datadir: samples/
#           ^- no '/' here
# …
output:
    datadir: results/
#            ^- no '/' here
DrYak commented 1 month ago

Another problem is that currently V-pipe doesn't provide any informations about Little Cherry Virus (See here for a list of available resources for viruses )

So this part is not going to work:

input:
    # …
    reference: "{VPIPE_BASEDIR}/resources/LChV-2/reference.fasta"
    genes_gff: "{VPIPE_BASEDIR}/../resources/LChV-2/genomic.gff"

You will need to provide your own. And change the configuration file accordingly. for example:

# create a resource directory in the current working directory:
mkdir -p resources/LChV-2/

# copy the files in there
cp …somewhere_where_you_have_the_files…/LChV-2/reference.fasta resources/LChV-2/
cp …somewhere_where_you_have_the_files…/LChV-2/genomic.gff resources/LChV-2/

and then edit the configuration file to point to this new resource directory you created:

input:
    # …
    reference: "resources/LChV-2/reference.fasta"
    genes_gff: "resources/LChV-2/genomic.gff"
    #           ^- no leading '/': search in the current working directory.

(Of course you could also install the files into your local copy of V-pipe, in which case you would have to fix a missing .. as the {VPIPE_BASEDIR} refers to the V-pipe/workflow/ directory, due to a limitation of how Snakemake works).

input:
    # …
    # '..' missing here --------vv
    reference: "{VPIPE_BASEDIR}/../resources/LChV-2/reference.fasta"
    genes_gff: "{VPIPE_BASEDIR}/../resources/LChV-2/genomic.gff"

(NOTE: if you decide to modify V-pipe to add support for LChV-2, we would be interested in your pull request)

robertsap commented 1 month ago

Thanks so much for your response! I hope you had a pleasant holiday :)

I made the necessary modifications to the directory paths in my config file, however I am still getting the same error message

config file: ` general: virus_base_config: ""

input: datadir: samples/ samples_file: samples.tsv reference: "{VPIPE_BASEDIR}/../resources/LChV-2/reference.fasta" genes_gff: "{VPIPE_BASEDIR}/../resources/LChV-2/genomic.gff" read_length: 150

output: datadir: results/ trim_primers: false snv: true local: true global: true visualization: true diversity: true QA: true upload: false dehumanized_raw_reads: false `

error message: WorkflowError: MissingInputException: Missing input files for rule sam2bam: output: results/22-L147/22-L147/alignments/REF_aln.bam, results/22-L147/22-L147/alignments/REF_aln.bam.bai wildcards: file=results/22-L147/22-L147/alignments/REF_aln affected files: results/22-L147/22-L147/alignments/REF_aln.sam WorkflowError: WorkflowError: MissingInputException: Missing input files for rule gunzip: output: results/22-L147/22-L147/extracted_data/R1.fastq wildcards: file=results/22-L147/22-L147/extracted_data/R1, ext=fastq affected files: results/22-L147/22-L147/extracted_data/R1.fastq.gz MissingInputException: Missing input files for rule gunzip: output: results/22-L147/22-L147/extracted_data/R1.fastq wildcards: file=results/22-L147/22-L147/extracted_data/R1, ext=fastq affected files: results/22-L147/22-L147/extracted_data/R1.fastq.gz CyclicGraphException: Cyclic dependency on rule convert_to_ref.

As for the reference, gff file locations etc. I did have them in the 'V-pipe/workflow' directory, so the pathing should have worked. However I did as you recommended and moved them into the 'resources' directory, and changed my config file to reflect the pathing (see above).

Thanks in advance for your patience. I'm a novice on the command line, so there may be something basic that I'm missing.