dorbarker / voc-identify

Identify Variants of Concern from metagenomic samples of SARS-CoV-2
MIT License
12 stars 2 forks source link

issue runnign snakemake workflow voc-identify.smk #32

Closed ibseq closed 3 years ago

ibseq commented 3 years ago

Hi Dillon any more info as how to run the the snakemake workflow voc-identify.smk? I've installed snakemake but nto able to run the voc-identify.smk

thanks ibseq

dorbarker commented 3 years ago

I'm happy to help, but it's not clear to me what problems you are experiencing with the workflow.

If you have invoked the workflow, but you're getting errors from it, then providing logs would be helpful for debugging. If instead you don't know how to use snakemake, then I would direct you to the snakemake tutorial.

Also, snakemake is not really necessary for using mmmvi. It's there as a convenience rather than a requirement. You can always run mmmvi directly on your samples, and I've provided examples for this in the readme.

ibseq commented 3 years ago

Hi Dillon thanks for the reply. I have created a directory with two folders: data and bams as per description. from there I run the command above. the log file after running the command simply list exactly what I copied in the issue above: I followed that path and there isn't much. If you don't use the smk file, how do you run multiple bams? the error is described in this issue https://github.com/dorbarker/voc-identify/issues/33

Also, two days ago I was able to test mmmvi on one sample (successfully), I tried again today and it doesn't work anymore. I can open a separate ticket if required?

so basically, I am not able to run it either way, with one bam file or multiple. I've re-installed mmmvi thanks ibseq

dorbarker commented 3 years ago

Here's an example that I just ran using version 0.8.2 freshly installed from conda.

This is the exact directory structure I ran it with.

(mmmvi) dbarker@worklaptop:~/Desktop$ tree
.
└── ibseq_github_help
    ├── bams
    │   ├── test-illumina.bam
    │   ├── test-illumina.bam.bai
    │   ├── test-nanopore.bam
    │   └── test-nanopore.bam.bai
    └── data
        ├── mutations.tsv
        └── reference.fasta

3 directories, 4 files

Below is the snakemake command as it was invoked on my computer.

snakemake -j 2 -s ~/src/voc-identify/voc-identify.smk -d ~/Desktop/ibseq_github_help/ --configfile ~/src/voc-identify/example-config.yaml

And this is the output generated by snakemake.

(mmmvi) dbarker@worklaptop:~/Desktop$ tree
.
└── ibseq_github_help
    ├── bams
    │   ├── test-illumina.bam
    │   ├── test-illumina.bam.bai
    │   ├── test-nanopore.bam
    │   └── test-nanopore.bam.bai
    ├── data
    │   ├── mutations.tsv
    │   └── reference.fasta
    └── reports
        ├── test-illumina
        │   ├── cooccurrence_matrices
        │   │   ├── absolute
        │   │   │   ├── B.1.1.7.txt
        │   │   │   └── P.1.txt
        │   │   └── relative
        │   │       ├── B.1.1.7.txt
        │   │       └── P.1.txt
        │   ├── read_report.txt
        │   ├── read_species.txt
        │   └── summary.txt
        └── test-nanopore
            ├── cooccurrence_matrices
            │   ├── absolute
            │   │   ├── B.1.1.7.txt
            │   │   └── P.1.txt
            │   └── relative
            │       ├── B.1.1.7.txt
            │       └── P.1.txt
            ├── read_report.txt
            ├── read_species.txt
            └── summary.txt

12 directories, 20 files

I did actually catch a small bug in the course of creating this example, although it is unrelated to the troubles you were having. I hope this helps, and I'll get that other unrelated bug fixed right away.

ibseq commented 3 years ago

Thanks Dillon.I will look again at the directory structure and re run it then

BW ibseq

Sent from my iPhone

On 22 Jun 2021, at 23:54, Dillon Barker @.***> wrote:

 Here's an example that I just ran using version 0.8.2 freshly installed from conda.

This is the exact directory structure I ran it with.

(mmmvi) @.***:~/Desktop$ tree . └── ibseq_github_help ├── bams │ ├── test-illumina.bam │ ├── test-illumina.bam.bai │ ├── test-nanopore.bam │ └── test-nanopore.bam.bai └── data ├── mutations.tsv └── reference.fasta

3 directories, 4 files Below is the snakemake command as it was invoked on my computer.

snakemake -j 2 -s ~/src/voc-identify/voc-identify.smk -d ~/Desktop/ibseq_github_help/ --configfile ~/src/voc-identify/example-config.yaml And this is the output generated by snakemake.

(mmmvi) @.***:~/Desktop$ tree . └── ibseq_github_help ├── bams │ ├── test-illumina.bam │ ├── test-illumina.bam.bai │ ├── test-nanopore.bam │ └── test-nanopore.bam.bai ├── data │ ├── mutations.tsv │ └── reference.fasta └── reports ├── test-illumina │ ├── cooccurrence_matrices │ │ ├── absolute │ │ │ ├── B.1.1.7.txt │ │ │ └── P.1.txt │ │ └── relative │ │ ├── B.1.1.7.txt │ │ └── P.1.txt │ ├── read_report.txt │ ├── read_species.txt │ └── summary.txt └── test-nanopore ├── cooccurrence_matrices │ ├── absolute │ │ ├── B.1.1.7.txt │ │ └── P.1.txt │ └── relative │ ├── B.1.1.7.txt │ └── P.1.txt ├── read_report.txt ├── read_species.txt └── summary.txt

12 directories, 20 files I did actually catch a small bug in the course of creating this example, although it is unrelated to the troubles you were having. I hope this helps, and I'll get that other unrelated bug fixed right away.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

ibseq commented 3 years ago

Hi Dillon I gave see below new run: it did something definitely, but ended up with errors:

(mmmvi) @.***:~/MMMVI$ snakemake -j 2 -s ibseq/MMMVI/voc-identify/voc-identify.smk -d ibseq/MMMVI/mmmvi_wwc019/ --configfile ibseq/MMMVI/voc-identify/example-config.yaml Building DAG of jobs... Using shell: /bin/bash Provided cores: 2 Rules claiming more threads will be scaled

down. Job counts: count jobs 1 all 264 find_vocs 265

[Wed Jun 23 08:29:32 2021] rule find_vocs: input: bams/7539_8942_S182.mapped.primertrimmed.sorted.bam, bams/7539_8942_S182.mapped.primertrimmed.sorted.bam.bai output: reports/7539_8942_S182.mapped.primertrimmed.sorted/summary.txt jobid: 179 wildcards: sample=7539_8942_S182.mapped.primertrimmed.sorted

[Wed Jun 23 08:29:32 2021] rule find_vocs: input: bams/7539_VIR1L_2107014_31_02_S47.mapped.primertrimmed.sorted.bam, bams/7539_VIR1L_2107014_31_02_S47.mapped.primertrimmed.sorted.bam.bai output: reports/7539_VIR1L_2107014_31_02_S47.mapped.primertrimmed.sorted/summary.txt jobid: 447 wildcards: sample=7539_VIR1L_2107014_31_02_S47.mapped.primertrimmed.sorted

2021-06-23 08:29:33 Begin Traceback (most recent call last): File "ibseq/miniconda2/envs/mmmvi/bin/mmmvi", line 10, in sys.exit(main()) File "ibseq/miniconda2/envs/mmmvi/lib/python3.8/site-packages/mmmvi/mmmvi.py", line 91, in main vocs = load_data.load_mutations( File "ibseq/miniconda2/envs/mmmvi/lib/python3.8/site-packages/mmmvi/lib/load_data.py", line 127, in load_mutations data = pd.read_csv(mutations_path, sep=delimiter) File "ibseq/miniconda2/envs/mmmvi/lib/python3.8/site-packages/pandas/io/parsers.py", line 610, in read_csv return _read(filepath_or_buffer, kwds) File "ibseq/miniconda2/envs/mmmvi/lib/python3.8/site-packages/pandas/io/parsers.py", line 468, in _read return parser.read(nrows) File "ibseq/miniconda2/envs/mmmvi/lib/python3.8/site-packages/pandas/io/parsers.py", line 1057, in read index, columns, col_dict = self._engine.read(nrows) File "ibseq/miniconda2/envs/mmmvi/lib/python3.8/site-packages/pandas/io/parsers.py", line 2061, in read data = self._reader.read(nrows) File "pandas/_libs/parsers.pyx", line 756, in pandas._libs.parsers.TextReader.read File "pandas/_libs/parsers.pyx", line 771, in pandas._libs.parsers.TextReader._read_low_memory File "pandas/_libs/parsers.pyx", line 827, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 1951, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 105, saw 3

2021-06-23 08:29:33 Begin Traceback (most recent call last): File "ibseq/miniconda2/envs/mmmvi/bin/mmmvi", line 10, in sys.exit(main()) File "ibseq/miniconda2/envs/mmmvi/lib/python3.8/site-packages/mmmvi/mmmvi.py", line 91, in main vocs = load_data.load_mutations( File "ibseq/miniconda2/envs/mmmvi/lib/python3.8/site-packages/mmmvi/lib/load_data.py", line 127, in load_mutations data = pd.read_csv(mutations_path, sep=delimiter) File "ibseq/miniconda2/envs/mmmvi/lib/python3.8/site-packages/pandas/io/parsers.py", line 610, in read_csv return _read(filepath_or_buffer, kwds) File "ibseq/miniconda2/envs/mmmvi/lib/python3.8/site-packages/pandas/io/parsers.py", line 468, in _read return parser.read(nrows) File "ibseq/miniconda2/envs/mmmvi/lib/python3.8/site-packages/pandas/io/parsers.py", line 1057, in read index, columns, col_dict = self._engine.read(nrows) File "ibseq/miniconda2/envs/mmmvi/lib/python3.8/site-packages/pandas/io/parsers.py", line 2061, in read data = self._reader.read(nrows) File "pandas/_libs/parsers.pyx", line 756, in pandas._libs.parsers.TextReader.read File "pandas/_libs/parsers.pyx", line 771, in pandas._libs.parsers.TextReader._read_low_memory File "pandas/_libs/parsers.pyx", line 827, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 1951, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 105, saw 3

[Wed Jun 23 08:29:33 2021] Error in rule find_vocs: jobid: 179 output: reports/7539_8942_S182.mapped.primertrimmed.sorted/summary.txt shell: mmmvi --bam bams/7539_8942_S182.mapped.primertrimmed.sorted.bam --reference data/reference.fasta --mutations data/mutations.tsv --outdir reports/7539_8942_S182.mapped.primertrimmed.sorted/ --only-vocs P.1 B.1.1.7 B.1.351 P.2 (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[Wed Jun 23 08:29:33 2021] Error in rule find_vocs: jobid: 447 output: reports/7539_VIR1L_2107014_31_02_S47.mapped.primertrimmed.sorted/summary.txt shell: mmmvi --bam bams/7539_VIR1L_2107014_31_02_S47.mapped.primertrimmed.sorted.bam --reference data/reference.fasta --mutations data/mutations.tsv --outdir reports/7539_VIR1L_2107014_31_02_S47.mapped.primertrimmed.sorted/ --only-vocs P.1 B.1.1.7 B.1.351 P.2 (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: ibseq/MMMVI/mmmvi_wwc019/.snakemake/log/2021-06-23T082931.728452.snakemake.log (mmmvi) @.***:~/MMMVI$ tree bash: tree: command not found

(mmmvi) @.:~/MMMVI$ ls -l mmmvi_wwc019/ total 312 drwxr-xr-x 2 ibseq ibseq 531 Jun 22 22:16 bams drwxr-xr-x 2 ibseq ibseq 4 Jun 23 08:26 data (mmmvi) @.:~/MMMVI$ ls -l mmmvi_wwc019/data/ total 38 -rw-r--r-- 1 ibseq ibseq 30427 Jun 21 15:45 MN908947.3.fasta -rw-r--r-- 1 ibseq ibseq 3650 Jun 23 08:26 mutations.tsv (mmmvi) @.***:~/MMMVI$

The bam file in the error line seems fine: samtools view -H mmmvi_wwc019/bams/7539_8942_S182.mapped.primertrimmed.sorted.bam @HD VN:1.6 SO:coordinate @SQ SN:MN908947.3 LN:29903 @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -t 4 ref.fa 7539_8942_S182_R1_001_val_1.fq.gz 7539_8942_S182_R2_001_val_2.fq.gz @PG ID:samtools PN:samtools PP:bwa VN:1.10 CL:samtools sort -o 7539_8942_S182.sorted.bam @PG ID:samtools.1 PN:samtools PP:samtools VN:1.10 CL:samtools view -F4 -o 7539_8942_S182.mapped.bam 7539_8942_S182.sorted.bam @PG ID:ivar-trim PN:ivar VN:1.3.1 CL:ivar trim -i 7539_8942_S182.mapped.bam -b primers_V3.bed -m 30 -q 20 -p ivar.out @PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.10 CL:samtools sort -o 7539_8942_S182.mapped.primertrimmed.sorted.bam ivar.out.bam @PG ID:samtools.3 PN:samtools PP:ivar-trim VN:1.10 CL:samtools sort -o 7539_8942_S182.mapped.primertrimmed.sorted.bam ivar.out.bam

samtools view -H mmmvi_wwc019/bams/7539_VIR1L_2107014_31_02_S47.mapped.primertrimmed.sorted.bam @HD VN:1.6 SO:coordinate @SQ SN:MN908947.3 LN:29903 @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -t 4 ref.fa 7539_VIR1L_2107014_31_02_S47_R1_001_val_1.fq.gz 7539_VIR1L_2107014_31_02_S47_R2_001_val_2.fq.gz @PG ID:samtools PN:samtools PP:bwa VN:1.10 CL:samtools sort -o 7539_VIR1L_2107014_31_02_S47.sorted.bam @PG ID:samtools.1 PN:samtools PP:samtools VN:1.10 CL:samtools view -F4 -o 7539_VIR1L_2107014_31_02_S47.mapped.bam 7539_VIR1L_2107014_31_02_S47.sorted.bam @PG ID:ivar-trim PN:ivar VN:1.3.1 CL:ivar trim -i 7539_VIR1L_2107014_31_02_S47.mapped.bam -b primers_V3.bed -m 30 -q 20 -p ivar.out @PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.10 CL:samtools sort -o 7539_VIR1L_2107014_31_02_S47.mapped.primertrimmed.sorted.bam ivar.out.bam @PG ID:samtools.3 PN:samtools PP:ivar-trim VN:1.10 CL:samtools sort -o 7539_VIR1L_2107014_31_02_S47.mapped.primertrimmed.sorted.bam ivar.out.bam

dorbarker commented 3 years ago

From the traceback, it looks like your mutations file has some malformation on line 105.

I've uploaded known-good example data to the main branch, under the example directory which you can compare against.

ibseq commented 3 years ago

Thanks Dillon. I’ll check what’s on line 105. Also, I realised I called the headers voc and mutation, not like in the example you’ve provided PangoLineage and NucName

BW Ibseq

Sent from my iPhone

On 23 Jun 2021, at 18:05, Dillon Barker @.***> wrote:

 From the traceback, it looks like your mutations file has some malformation on line 105.

I've uploaded known-good example data to the main branch, under the example directory which you can compare against.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

dorbarker commented 3 years ago

PangoLineage and NucName as the defaults is a bit idiosyncratic, for sure. That's only the case because those are what our lab happened to be using at the time I started work on this. The defaults might change to something more general like voc and mutation when I make an interface-breaking change, i.e. version 0.x.x1.0.0.

You can also override these defaults with --mutation-column and --voc-column like:

mmmvi --voc-column voc --mutation-column mutation --bam sample.bam etc etc
ibseq commented 3 years ago

Thanks Dillon, it works only if I use the list of snps in your mutations.tsv Not sure if I am not formatting correctly, but even using yours as a template, it doesn't work with my list of SNPs - attached

any suggestions?

thanks ibseq

snp_list.xlsx

dorbarker commented 3 years ago

Okay, so a few issues here that I can spot:

  1. It's an MS Excel sheet when the input needs to be a delimited text file (CSV, TSV, etc)
  2. You have a bunch of mutations in the form of C733, which is not a supported format
  3. Leading spaces on some mutation strings (although I may add some code to strip superfluous whitespace in a future release)
ibseq commented 3 years ago

Thanks Dillon - I have tsv for the command line but couldn’t upload on GitHub so changed it to xls

Not sure about the c733 form…wrote all snps on same doc and saved it.

What do you suggest?

Thanks! Ibseq

Sent from my iPhone

On 25 Jun 2021, at 20:28, Dillon Barker @.***> wrote:

 Okay, so a few issues here that I can spot:

It's an MS Excel sheet when the input needs to be a delimited text file (CSV, TSV, etc) You have a bunch of mutations in the form of C733, which is not a supported format Leading spaces on some variants (although I may add some code to string superfluous whitespace in a future release) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

dorbarker commented 3 years ago

I suggest carefully reviewing the mutations file. There are formatting errors in it which will block mmmvi from running.

The just-released latest version, 0.10.0, includes some features which you may find helpful. When loading the tabular mutations file, if it fails to parse a mutation, it will report the faulty string which caused the error. Additionally, it can now use the variant definitions released by Public Health England. As the argument to --mutations you can a path a directory containing their YAML files. You can do this now:

git clone https://github.com/phe-genomics/variant_definitions

mmmvi --mutations variant_definitions/variant_yaml/ etc etc etc
ibseq commented 3 years ago

Thanks Dillon I'm using the list from PHE - the new argument might solve the issues of formatting correctly the tsv

BW ibseq