cbg-ethz / V-pipe

V-pipe is a pipeline designed for analysing NGS data of short viral genomes
https://cbg-ethz.github.io/V-pipe/
Apache License 2.0
132 stars 46 forks source link

Running with tutorial dataset giving all empty file #119

Open wasimbt opened 2 years ago

wasimbt commented 2 years ago

Hi,

I have tried to run the tutorial dataset (SARS-CoV-2) with the command mentioned on the webpage, https://cbg-ethz.github.io/V-pipe/tutorial/sars-cov2/

However, after running successfully the whole script, the generated output files are all empty. Could anyone help me find out where is the problem?

Is it happeining in prinseq? Below is the output of prinseq;


The length cutoff is: 200 [prinseq-lite-0.20.4] [01/21/2022 23:24:18] Executing PRINSEQ with command: "perl prinseq-lite.pl -fastq samples/SRR10903401/20200102/extracted_data/R1.fastq -fastq2 samples/SRR10903401/20200102/extracted_data/R2.fastq -out_format 3 -trim_qual_right 30 -trim_qual_left 30 -trim_qual_type min -trim_qual_rule lt -trim_qual_window 10 -trim_qual_step 1 -log samples/SRR10903401/20200102/preprocessed_data/prinseq.out.log -min_len 200 -min_qual_mean 30 -ns_max_n 4 -out_good samples/SRR10903401/20200102/preprocessed_data/R -out_bad null" [prinseq-lite-0.20.4] [01/21/2022 23:24:18] Parsing and processing input data: "samples/SRR10903401/20200102/extracted_data/R1.fastq" and "samples/SRR10903401/20200102/extracted_data/R2.fastq" [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Done parsing and processing input data [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input sequences (file 1): 476,632 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input bases (file 1): 71,758,102 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input mean length (file 1): 150.55 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input sequences (file 2): 476,632 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input bases (file 2): 71,807,572 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input mean length (file 2): 150.66 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Good sequences (pairs): 0 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Good sequences (singletons file 1): 0 (0.00%) [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Good sequences (singletons file 2): 0 (0.00%) [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Bad sequences (file 1): 476,632 (100.00%) [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Bad bases (file 1): 71,758,102 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Bad mean length (file 1): 150.55 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Bad sequences (file 2): 0 (0.00%) [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Sequences filtered by specified parameters: [prinseq-lite-0.20.4] [01/21/2022 23:25:05] trim_qual_left: 180134 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] min_len: 773130

Please let me know,

Best regards, Wasim

DrYak commented 2 years ago

Could anyone help me find out where is the problem?

You skipped this part of the the tutorial:

As it is your first run of V-pipe, this will also generate the sample collection table. Check samples.tsv in your editor.

Note that the demo files you downloaded have reads of length 150 only. V-pipe’s default parameters are optimized for reads of > length 250 ; add the third column in the tab-separated file:

SRR10903401   20200102    150
SRR10903402   20200102    150

For more information, in the latest master branch of V-pipe, you can look in file V-pipe/config/config.html, see section "input", parameter "read_length".

Is it happeining in prinseq? Below is the output of prinseq;

Yes. The relevant bits are:

The length cutoff is: 200
…
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input mean length (file 1): 150.55
…
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input mean length (file 2): 150.66

Because there's no length specified (e.g.: in the third column as suggested in the tutorial), the cut-off value is computed from the default length (based on 250bp) and set to ≥200bp.

The read's length (150bp) is shorter than this cut-off, every single read is considered "too short" and filtered out. There are no remaining reads after quality filtering.

Future version of V-pipe will be able to illustrate the stage at which reads were rejected (we currently have custom scripts that haven't been incorporated into V-pipe's master yet).

wasimbt commented 2 years ago

Hi,

Thanks a lot for pointing it out! I somehow missed the information in the tutorial.

Now I have run the tutorial dataset with success. However, when I tried to run my sewage samples dataset (with just two samples), it keep running without an error message and I had to stop it after a few hours. It seems like there is a problem during alignment. Do you think that the pipeline will work for read length of 101 bp ? and where could be the problem according to you?

Best regards, Wasim

DrYak commented 2 years ago

regarding alignments

Check if the previous step (running prinseq in rule preprocessing) has successfully completed and you successfully got the /preprocessed_data/R1.fastq.gz and R2.fastq.gz files.

Which aligner are you using?

Also if you have extremely deep coverage as happens e.g. with Illumina NovaSeq, due to the sheer amount of reads the alignment can take some time, and the SNV calling with ShoRAH can be extremely slow (we are aware of these shortcomings, but it will take quite some time to fix. In the meantime you can fall back to LoFreq as an SNV caller).

regarding the read-length

Also, although we have run V-pipe successfully even on 50bp or 75bp read-length (some of the labs that sequence clinical swab samples do shotgun sequencing), you do need to keep in mind with sewage sequencing is that read-length matters: source: doi:10.1101/2021.05.26.21257861

Other groups have indeed shown that the optimal amplicon length for wastewater seem to be around ~400bp as done by, e.g., the Artic protocols (our monitoring used V3 in the past and has now switched to V4 which has better performance on Omicron and plans to switch to V4.1 as soon as it is available in commercial ready-made kits).

With the much shorter amplicons (e.g. like Nimagen), you might have fewer opportunities for cooccurrences of mutations. (and with much longer amplicons than Artic (e.g. long reads) there might be fewer fragments as large that manage to survive through the wastewater all the way to the sequencing).

This is important to keep in mind if you would want to later run our tool cojac on the output to look for mutation cooccurrences. (note: as of the time being, check its dev branch as we haven't released version 0.2 yet).

Looking for individual mutations:

For all our wastewater sequencing, we have used Artic protocols that produce ~400bp amplicon and sequenced those with paired-end 250bp sequencing in order to cover well each whole amplicon.

DrYak commented 2 years ago

For more information, in the latest master branch of V-pipe, you can look in file V-pipe/config/config.html, see section "input", parameter "read_length".

Because there's no length specified (e.g.: in the third column as suggested in the tutorial), the cut-off value is computed from the default length (based on 250bp) and set to ≥200bp.

The read's length (150bp) is shorter than this cut-off, every single read is considered "too short" and filtered out. There are no remaining reads after quality filtering.

Hi, we have updated the config's readme section about the samples TSV to make this more clear.

wasimbt commented 2 years ago

Hi,

Many thanks for your answers and suggestions! I am using bwa following the SARS-CoV-2 tutorial. Now I could run the whole snakemake script on a sewage sample. The prinseq runs fine (please see the output file attached) however, there is an error during alignment (please see the attached shorah output file). Coverage.TSV file showed more zeros than other digits at positions. Any suggestion/s on how I can improve the coverage/alignment?

Best regards, Wasim

prinseq.out.log shorah.out.log stdout.log

DrYak commented 2 years ago

First, making sure that the alignment step (bwa) went well.

In the output results/{sample}/{date}/alignments (or if you have set your output directory to emulate old V-pipe 1.x/2.x/sars-cov2 behaviour samples/{sample}/{date}/alignments):

If you only have one region that is highly covered (i.e.: only a few fragment of the virus reached all the way to sequencing) there might by some wet-lab problems (in our experience, the RNA extraction and concentration is a very difficult task. Your lab might want to discuss with our coauthors from Eawag (Swiss Federal Institute of Aquatic Science and Technology) on our paper. Some labs we work with have reported success using the RNA extraction kits produced by Promega).


As soon as you have some good aligner reads, you might want to give a try to our tool cojac.

Use the dev branch as we haven't released a package 0.2 on bioconda yet.


Now regarding SNV calling: what is the exact wet-lab protocol used?

If it's the later case: there is a corner case in ShoRAH that handles poorly when the local-windows for the local-haplotypes aren't aligned to the boundaries of amplicons, leading to a lot of the reads being lost. The shorter the amplicon, the more this problem is exacerbated. We are aware of that limitation and there's undergoing work by a master student to address it. In the meantime I would suggest using lofreq instead (in your configuration file, section general -> property _snvcaller) and disabling local haplotypes (in your configuration file, section output -> property local). You can find out more about these options in the file config/config.html in your local installation of V-pipe.