cbg-ethz / V-pipe

V-pipe is a pipeline designed for analysing NGS data of short viral genomes
https://cbg-ethz.github.io/V-pipe/
Apache License 2.0
129 stars 43 forks source link

Getting snake error #120

Open dstcr opened 2 years ago

dstcr commented 2 years ago

VPIPE_BASEDIR = /opt/V-dock/V-pipe/workflow Importing legacy configuration file vpipe.config MissingSectionHeaderError in line 107 of /opt/V-dock/V-pipe/workflow/rules/common.smk: File contains no section headers. file: 'vpipe.config', line: 1 'general:\n' File "/opt/V-dock/V-pipe/workflow/Snakefile", line 12, in File "/opt/V-dock/V-pipe/workflow/rules/common.smk", line 259, in File "/opt/V-dock/V-pipe/workflow/rules/common.smk", line 170, in process_config File "/opt/V-dock/V-pipe/workflow/rules/common.smk", line 107, in load_legacy_ini File "/opt/conda/envs/snakemake/lib/python3.10/configparser.py", line 698, in read File "/opt/conda/envs/snakemake/lib/python3.10/configparser.py", line 1086, in _read

dstcr commented 2 years ago

update: seems to resolve when changing to a config.yaml file instead of a .config... but still get an error

**VPIPE_BASEDIR = /opt/V-dock/V-pipe/workflow Using base configuration virus HIV WARNING: No samples found in samples/. Not generating config/samples.tsv. WARNING: Sample list file config/samples.tsv not found. Building DAG of jobs... Job stats: job count min threads max threads


all 1 1 1 total 1 1 1

[Sat Jan 22 10:37:15 2022] localrule all: jobid: 0 resources: tmpdir=/tmp

Job stats: job count min threads max threads


all 1 1 1 total 1 1 1

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.**

I have a folder called 'samples' in the directory I'm working in, why doesn't it see it?

DrYak commented 2 years ago

Regarding the first part:

File contains no section headers. file: 'vpipe.config', line: 1 'general:\n'

vpipe.config files use the old INI-like config format that is standard in Python's config parser. So instead of yaml-style:

general:
   x: y

you would have had to use INI-stlye:

[general]
x=y

but modern snakemake encourage using YAML or JSON files instead, so renaming your configuration as config.yaml and keeping the YAML format is the most standard-compliant.

You can find out more about the configuration in the config's README.md, and in the manual found in config/config.html in your local installation.

dstcr commented 2 years ago

thank you! That makes sense.

Any thoughts on the samples directory issue?

DrYak commented 2 years ago

Regarding the later part:

Using base configuration virus HIV
WARNING: No samples found in samples/. Not generating config/samples.tsv.
WARNING: Sample list file config/samples.tsv not found.

First note:

Second note:

Please check using tree samples (on linux) or something equivalent (on most unices you can use find samples to list all its content).

Again the README.md inside config/ and the config/config.html file can help you.

dstcr commented 2 years ago

Hmm ok, thanks for that- I'll alter the directories.

I'm using the docker version- does this still apply? Should I be using docker in ubuntu or can I use it in the terminal?

DrYak commented 2 years ago

Hmm ok, thanks for that- I'll alter the directories.

Since a few days, the latest version contains a few utilities that might help you automatically build this structure + give you a TSV that you can append to your samples.tsv file.

Check the samples mass importers section of the utils' README.md.

I'm using the docker version- does this still apply?

Yes, the config, the samples/ directory (and optionnally the TSV if you pass one) should all be supplied as part of the directory that you mount with the -v volume parameter when running the docker.

Should I be using docker in ubuntu or can I use it in the terminal?

I am not 100% sure I understand your question.

I've tested the docker command from a terminal as written in the README.md

dstcr commented 2 years ago

Thank you very much for that-

Still getting errors tho :(

VPIPE_BASEDIR = /opt/V-dock/V-pipe/workflow Using base configuration virus HIV AttributeError in line 230 of /opt/V-dock/V-pipe/workflow/rules/common.smk: 'str' object has no attribute 'setdefault' File "/opt/V-dock/V-pipe/workflow/Snakefile", line 12, in File "/opt/V-dock/V-pipe/workflow/rules/common.smk", line 259, in File "/opt/V-dock/V-pipe/workflow/rules/common.smk", line 230, in process_config File "/opt/conda/envs/snakemake/lib/python3.10/site-packages/jsonschema/validators.py", line 250, in validate File "/opt/conda/envs/snakemake/lib/python3.10/site-packages/jsonschema/validators.py", line 226, in iter_errors File "/opt/conda/envs/snakemake/lib/python3.10/site-packages/jsonschema/_validators.py", line 332, in properties File "/opt/conda/envs/snakemake/lib/python3.10/site-packages/jsonschema/validators.py", line 242, in descend File "/opt/conda/envs/snakemake/lib/python3.10/site-packages/jsonschema/validators.py", line 226, in iter_errors

dstcr commented 2 years ago

Update:

I created a empty folder 'config' and it worked!

Thank you very much

DrYak commented 2 years ago

Though you have managed to make it work, I would be interested in your YAML and TSV files (if you don't mind sharing them) so I can see what was triggering the error message, with the aim to eventually fix it or make the message more expressive.

dstcr commented 2 years ago

It runs but I now get this error (sorry for all this)

Error in rule hmm_align: jobid: 3 output: results/patient1/76486548/alignments/full_aln.sam, results/patient1/76486548/alignments/rejects.sam, results/patient1/76486548/references/ref_ambig.fasta, results/patient1/76486548/references/ref_majority.fasta log: results/patient1/76486548/alignments/ngshmmalign.out.log, results/patient1/76486548/alignments/ngshmmalign.err.log (check log file(s) for error message) conda-env: /opt/V-dock/conda_envs/392127d397a85e204df25555b5464461 shell:

        CONSENSUS_NAME=results/patient1/76486548
        CONSENSUS_NAME="${CONSENSUS_NAME#*/}"
        CONSENSUS_NAME="${CONSENSUS_NAME//\//-}"

        # 1. clean previous run
        rm -rf   results/patient1/76486548/alignments
        rm -f    results/patient1/76486548/references/ref_ambig.fasta
        rm -f    results/patient1/76486548/references/ref_majority.fasta
        mkdir -p results/patient1/76486548/alignments
        mkdir -p results/patient1/76486548/references

        # 2. perform alignment # -l = leave temps
        ngshmmalign -v  -R results/patient1/76486548/references/initial_consensus.fasta -o results/patient1/76486548/alignments/full_aln.sam -w results/patient1/76486548/alignments/rejects.sam -t 1 -N "${CONSENSUS_NAME}"  results/patient1/76486548/preprocessed_data/R1.fastq results/patient1/76486548/preprocessed_data/R2.fastq > results/patient1/76486548/alignments/ngshmmalign.out.log 2> >(tee results/patient1/76486548/alignments/ngshmmalign.err.log >&2)

        # 3. move references into place
        mv results/patient1/76486548/{alignments,references}/ref_ambig.fasta
        mv results/patient1/76486548/{alignments,references}/ref_majority.fasta

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /work/.snakemake/log/2022-01-22T165243.821355.snakemake.log

The Alignment error file gives this:

Warning: en_US.UTF-8 could not be imbued, this is likely due to a missing locale on your system terminate called without an active exception

dstcr commented 2 years ago

I'm using this config file:

general: virus_base_config: hiv

input: datadir: samples/

output: datadir: results snv: true local: true global: false visualization: true QA: true

dstcr commented 2 years ago

Also getting a lot of 'bad' sequences

Input and filter stats: Input sequences (file 1): 89,385 Input bases (file 1): 9,028,054 Input mean length (file 1): 101.00 Input sequences (file 2): 89,385 Input bases (file 2): 9,075,832 Input mean length (file 2): 101.54 Good sequences (pairs): 0 Good sequences (singletons file 1): 0 (0.00%) Good sequences (singletons file 2): 0 (0.00%) Bad sequences (file 1): 89,385 (100.00%) Bad bases (file 1): 9,028,054 Bad mean length (file 1): 101.00 Bad sequences (file 2): 0 (0.00%) Sequences filtered by specified parameters: trim_qual_left: 48334 min_len: 130436

DrYak commented 2 years ago

Okay, I got what is happening (though I would need the content of the samples.tsv and a little bit more of the logs). The filtering step (PrinSeq) is misfiltering most of your reads:

Input and filter stats: … Input mean length (file 1): 101.00 … Input mean length (file 2): 101.54 Good sequences (pairs): 0

Your files are pair-ended and contain 100bp per read.

By default, V-pipe assumes a length of 250bp. (and if you look a few lines higher in the log you should see something along the lines of:)

The length cutoff is: 200

Because every single read is shorter than the cut-off, every single one is considered bad by PrinSeq.

(You can find a similar problem as your, but with more logging in #119 )

You should specify what is the expected read length. Either by adding a third column in your TSV file, or by setting the default parameter in the configuration, section "input:" -> property _"readlength:".

There is a README.md in config/ that introduces how to configure V-pipe. You can also find a full manual in config/config.html if you download the repo locally (online visualization of the file is currently broken).

An older walk-through about configuring V-pipe can be found in the sars-cov-2 tutorial - it's a bit older but gives an oversight of the needed steps (there's a section in the config's README.md pointing the changes in more recent versions of V-pipe)

If you use the mass-importers in the utils/ directory they will write the read-lengths in thrid column, either trying to automagically determine those (in the case of sort_samples_dumb) or deriving them from the sequencing parameter (in the case of sort_samples_demultiplexstat and sort_samples_jobinfo).

carolinasisco commented 2 years ago

Update:

I created a empty folder 'config' and it worked!

Thank you very much

Hi, did you create this folder in your working directory? I´m getting an error message like the one you described

DrYak commented 2 years ago

Your files are pair-ended and contain 100bp per read.

By default, V-pipe assumes a length of 250bp. (and if you look a few lines higher in the log you should see something along the lines of:) …

Hi, we have updated the config's readme section about the samples TSV to make this more clear.

DrYak commented 2 years ago

@carolinasisco:

Hi, did you create this folder in your working directory?

How are you running V-pipe? In general all the files need to go in the working directory: Either the directory from which you run the ./vpipe or the directory that you mount with docker's -v volume option.

It should look like something like this:

(current directory)
├── config/
│   ├── config.yaml
│   └── samples.tsv
│
├── samples/
│   ├── patient1/
│   │   ├── 20100113/
│   │   │   └──raw_data/
│
│   … (all the samples) …
│
└── results/
    ├── patient1/
    │   ├── 20100113/

    … (V-pipe write all its outputs here) …

(We're basically following the Distribution and Reproducibility recommendations of Snakemake )

carolinasisco commented 2 years ago

Hi,

  1. I´m running V-pipe with this command: ./vpipe --jobs 4 --printshellcmds --dry-run

  2. I created the samples directory following the structure listed.

  3. I´m not giving the samples.tsv since is optional, I´m stating in the config file the datadir, my config file looks like this:

name: SARS-CoV-2 general: aligner: "bwa"

input: datadir: "{VPIPE_BASEDIR}/../samples/" read_length: 151 reference: "{VPIPE_BASEDIR}/../references/MT3502821.fasta" metainfo_file: "{VPIPE_BASEDIR}/../references/metainfo.yaml" gff_directory: "{VPIPE_BASEDIR}/../references/gffs/" primers_file: "{VPIPE_BASEDIR}/../references/primers/nCoV-2019.tsv" phylogeny_data: "{VPIPE_BASEDIR}/../resources/sars-cov-2/phylogeny/selected_covid_sequences.fasta"

frameshift_deletions_checks: genes_gff: "{VPIPE_BASEDIR}/../references/gffs/MT350282.1.gb"

snv: consensus: false

lofreq: consensus: false

DrYak commented 2 years ago

I´m not giving the samples.tsv since is optional, I´m stating in the config file the datadir,

Yes, specially since you're setting the read-length in your tsv: read_length: 151, you can indeed skip it and V-pipe should autotically generate one for you in config/samples.tsv.

my config file looks like this:

input:
  datadir: "{VPIPE_BASEDIR}/../samples/"

Oh, sorry, this is my fault: mys instruction weren't clear enough. I see what you did: you edited a virus base-config (probably SARS-CoV-2's) trying to adapt it.

The thing is that all the virus base-config we provide use files (references, etc.) which are stored in V-pipe, in its subdirectory resourses/. That's what this {VPIPE_BASEDIR} part is: it looks in the V-pipe installation directory (starting from the workflow/ subdirectory where all the snakemake are). So, instead of looking in your current working directory, this is trying to look for files in the directory where V-pipe was installed. (If I understand correctly, you have a directory reference/ in your current work directory with some files here).

You should instead have written something like:

general:
  aligner: "bwa"

input:
  datadir: "samples/"
  read_length: 151
  reference: "references/MT3502821.fasta"
  metainfo_file: "references/metainfo.yaml"
  gff_directory: "references/gffs/"
  primers_file: "references/primers/nCoV-2019.tsv"
  phylogeny_data: "{VPIPE_BASEDIR}/../resources/sars-cov-2/phylogeny/selected_covid_sequences.fasta"

frameshift_deletions_checks:
  genes_gff: "references/gffs/MT350282.1.gb"

snv:
  consensus: false

lofreq:
  consensus: false

An alternative would be to load the SARS-CoV-2 virus base config, and then only replace the properties that you need to change, leaving the defaults from the virus base config for the other options. The following will use your files that you have in your working directory's subdirectory references/, will set the read_lenght, but for everything else will leave the default

general:
  virus_base_config: "sars-cov-2"

input:
  read_length: 151
  reference: "references/MT3502821.fasta"
  metainfo_file: "references/metainfo.yaml"
  gff_directory: "references/gffs/"
  primers_file: "references/primers/nCoV-2019.tsv"

frameshift_deletions_checks:
  genes_gff: "references/gffs/MT350282.1.gb"
carolinasisco commented 2 years ago

Hi, I managed to start running the program with the changes you suggested on the config file. (Thanks!)

I see that in my results/SNVs folder the shorah.log files show no error yet, but the pipeline has been showing a message of "Activating conda environment: /home/geninfo/msisco/V-pipe/workflow/.snakemake/conda/6b4b4f9b39457eac0d31ae59ec8c6ce6 for a few hours, without any changes.. is this normal?

The REGION_1 folder shows the reads-support.fast files but is not showing any new files or the final snvs.vcf file

Again, thank you for taking the time to answer all this questions.

DrYak commented 2 years ago

but the pipeline has been showing a message of "Activating conda environment: /home/geninfo/msisco/V-pipe/workflow/.snakemake/conda/6b4b4f9b39457eac0d31ae59ec8c6ce6 for a few hours, without any changes.. is this normal?

Yes, shorah doesn't output much on the console.

I see that in my results/SNVs folder the shorah.log files show no error yet

indeed, that files is the best place to track the output of ShoRAH. Depending on how many CPU cores you have setup (section snv -> property threads) it may takes some time to finish. It's an embarrassingly parallel type of computation: each window is independent of the other, and ShoRAH's dirichlet sampler can be run independently in parallel, so lang that you provide enough CPU cores and have enough memory (in my experience roughly 1~2 GB per thread depending on coverage).

Note that currently there's an issue in ShoRAH and it might lose SNVs that happen close to boundaries of amplicons if all their read start exactly at the same position. This will be fixed in a version of ShoRAH that will be fixed in the comming month. Shotgun type sequencing (where amplicons are fragmented before making libraries and reads' start are randomly distributed along the genome) are not affected by this bug.


In the meantime lofreq can be an alternative for more speed, or to circumvent the above mentionned but.


Another tools that might be interesting is to experiment with cojac, though it's not (Yet) integrated into V-pipe and needs to be run separate (you can provide it a V-pipe's sample TSV file and an output directory which contain alignement bams).

Try the dev branch as we haven't release version 0.2 on bioconda yet.

This is the tool that we use in our wastewater surveillance. As it relies only on alignment, it provides much faster than running an SNV caller.

(Though as it only relies on cooccurrence of select mutations, it isn't as high confidence as ShoRAH which relies on entire local haplotypes - but in our practical experience with the surveillance, this is good enough for early detection of variants)

carolinasisco commented 2 years ago

Hello,

I ran the pipeline successfully, it lasted around 18 days. Since I was doing a run test I only set to obtain local haplotypes and now would like to obtain the fully assembled haplotypes. I read that one can produce intermediate results by setting the target rule: ex. ./vpipe minor_variants What´s the rule for haplotypes? I tried "haploclique" but did not work Thanks!

DrYak commented 2 years ago

I ran the pipeline successfully, it lasted around 18 days.

Yes, it can take time. For future runs, I would suggest running it on the cluster or on a very-large workstation. As an example, in your research (e.g. the paper I've mentioned above) we run ShoRAH with 64 threads, which should consume up to 128GiB RAM. The compute nodes on our cluster have 128 cores (2x AMD Threadripper each with 64 threads) and 512GiB of RAM. This makes things much faster.

and now would like to obtain the fully assembled haplotypes

Keep in mind that SARS-CoV-2 doesn't have that many mutations, and as Illumina's sequencing is short-reads based, you might have not have enough information to distinguish haplotype and match neighbouring reads (i.e.: there whole regions where all the variant have similar reads). Building global haplotype might not be possible. But it's worth trying.

I read that one can produce intermediate results by setting the target rule: ex. ./vpipe minor_variants What´s the rule for haplotypes? I tried "haploclique" but did not work

So, the rules themselves are called indeed haploclique, savage and predicthaplo. But snakemake is probably going to complain: This aren't rules which produce a single determined file (like minor_variants which generates a single central TSV with all the minor variants), they are wildcards-based rules - (e.g.: specify how to make a .fasta file from a .bam file).

If you want to directly invoke the rule, you would need to give an exact output that it must generate. e.g.

-/v-pipe results/patient1/20100113/variants/global/quasispecies.fasta

You could obtain something toughly similar by turning on the haplotype reconstruction by adding these options to the relevant section of your configuration YAML file (as usual check config/config.html for references):

general:
   haplotype_reconstruction: haploclique
output:
  global: True

And then use the --until option of snakemake:

-/v-pipe --until haploclique
carolinasisco commented 2 years ago

Hi,

Thanks for your kind reply. I tried to run savage but It gives me this error: PaddingError: Placeholder of length '80' too short in package /home/geninfo/msisco/V-pipe/workflow/.snakemake/conda/a5fe929f836a61b822bde39a31846bbe/bin/Rscript. The package must be rebuilt with conda-build > 2.0.

I read that a possible workaround is to use "--no-build-id" but I´m not sure how to.