Open wasimbt opened 2 years ago
Could anyone help me find out where is the problem?
You skipped this part of the the tutorial:
As it is your first run of V-pipe, this will also generate the sample collection table. Check samples.tsv in your editor.
Note that the demo files you downloaded have reads of length 150 only. V-pipe’s default parameters are optimized for reads of > length 250 ; add the third column in the tab-separated file:
SRR10903401 20200102 150 SRR10903402 20200102 150
For more information, in the latest master branch of V-pipe, you can look in file V-pipe/config/config.html, see section "input", parameter "read_length".
Is it happeining in prinseq? Below is the output of prinseq;
Yes. The relevant bits are:
The length cutoff is: 200
…
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input mean length (file 1): 150.55
…
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input mean length (file 2): 150.66
Because there's no length specified (e.g.: in the third column as suggested in the tutorial), the cut-off value is computed from the default length (based on 250bp) and set to ≥200bp.
The read's length (150bp) is shorter than this cut-off, every single read is considered "too short" and filtered out. There are no remaining reads after quality filtering.
Future version of V-pipe will be able to illustrate the stage at which reads were rejected (we currently have custom scripts that haven't been incorporated into V-pipe's master yet).
Hi,
Thanks a lot for pointing it out! I somehow missed the information in the tutorial.
Now I have run the tutorial dataset with success. However, when I tried to run my sewage samples dataset (with just two samples), it keep running without an error message and I had to stop it after a few hours. It seems like there is a problem during alignment. Do you think that the pipeline will work for read length of 101 bp ? and where could be the problem according to you?
Best regards, Wasim
Check if the previous step (running prinseq in rule preprocessing) has successfully completed and you successfully got the /preprocessed_data/R1.fastq.gz
and R2.fastq.gz
files.
Which aligner are you using?
bwa
as recommended in the SARS-CoV-2 tutorial and in the latest V-pipe's sars-cov-2
virus base config?ngshmmalign
? this later is way overkill for a virus that has a relatively low mutation rate (compare to HIV for which that aligner has been specially created).Also if you have extremely deep coverage as happens e.g. with Illumina NovaSeq, due to the sheer amount of reads the alignment can take some time, and the SNV calling with ShoRAH can be extremely slow (we are aware of these shortcomings, but it will take quite some time to fix. In the meantime you can fall back to LoFreq as an SNV caller).
Also, although we have run V-pipe successfully even on 50bp or 75bp read-length (some of the labs that sequence clinical swab samples do shotgun sequencing), you do need to keep in mind with sewage sequencing is that read-length matters: source: doi:10.1101/2021.05.26.21257861
Other groups have indeed shown that the optimal amplicon length for wastewater seem to be around ~400bp as done by, e.g., the Artic protocols (our monitoring used V3 in the past and has now switched to V4 which has better performance on Omicron and plans to switch to V4.1 as soon as it is available in commercial ready-made kits).
With the much shorter amplicons (e.g. like Nimagen), you might have fewer opportunities for cooccurrences of mutations. (and with much longer amplicons than Artic (e.g. long reads) there might be fewer fragments as large that manage to survive through the wastewater all the way to the sequencing).
This is important to keep in mind if you would want to later run our tool cojac on the output to look for mutation cooccurrences. (note: as of the time being, check its dev branch as we haven't released version 0.2 yet).
Looking for individual mutations:
alignment/basecnt.tsv.gz
)For all our wastewater sequencing, we have used Artic protocols that produce ~400bp amplicon and sequenced those with paired-end 250bp sequencing in order to cover well each whole amplicon.
For more information, in the latest master branch of V-pipe, you can look in file V-pipe/config/config.html, see section "input", parameter "read_length".
Because there's no length specified (e.g.: in the third column as suggested in the tutorial), the cut-off value is computed from the default length (based on 250bp) and set to ≥200bp.
The read's length (150bp) is shorter than this cut-off, every single read is considered "too short" and filtered out. There are no remaining reads after quality filtering.
Hi, we have updated the config's readme section about the samples TSV to make this more clear.
Hi,
Many thanks for your answers and suggestions! I am using bwa following the SARS-CoV-2 tutorial. Now I could run the whole snakemake script on a sewage sample. The prinseq runs fine (please see the output file attached) however, there is an error during alignment (please see the attached shorah output file). Coverage.TSV file showed more zeros than other digits at positions. Any suggestion/s on how I can improve the coverage/alignment?
Best regards, Wasim
First, making sure that the alignment step (bwa
) went well.
In the output results/
{sample}/
{date}/alignments
(or if you have set your output directory to emulate old V-pipe 1.x/2.x/sars-cov2 behaviour samples/
{sample}/
{date}/alignments
):
REF_aln_stats.yaml
: it contains useful statistics collected after alignement:
coverage.tsv.gz
(warning, this is 0-based like all python software and the old central coverage.tsv file of V-pipe, it is not 1-based like all genomic standards and the output of samtools)
If you only have one region that is highly covered (i.e.: only a few fragment of the virus reached all the way to sequencing) there might by some wet-lab problems (in our experience, the RNA extraction and concentration is a very difficult task. Your lab might want to discuss with our coauthors from Eawag (Swiss Federal Institute of Aquatic Science and Technology) on our paper. Some labs we work with have reported success using the RNA extraction kits produced by Promega).
As soon as you have some good aligner reads, you might want to give a try to our tool cojac.
Use the dev branch as we haven't released a package 0.2 on bioconda yet.
Now regarding SNV calling: what is the exact wet-lab protocol used?
If it's the later case: there is a corner case in ShoRAH that handles poorly when the local-windows for the local-haplotypes aren't aligned to the boundaries of amplicons, leading to a lot of the reads being lost. The shorter the amplicon, the more this problem is exacerbated. We are aware of that limitation and there's undergoing work by a master student to address it. In the meantime I would suggest using lofreq instead (in your configuration file, section general -> property _snvcaller) and disabling local haplotypes (in your configuration file, section output -> property local). You can find out more about these options in the file config/config.html in your local installation of V-pipe.
Hi,
I have tried to run the tutorial dataset (SARS-CoV-2) with the command mentioned on the webpage, https://cbg-ethz.github.io/V-pipe/tutorial/sars-cov2/
However, after running successfully the whole script, the generated output files are all empty. Could anyone help me find out where is the problem?
Is it happeining in prinseq? Below is the output of prinseq;
The length cutoff is: 200 [prinseq-lite-0.20.4] [01/21/2022 23:24:18] Executing PRINSEQ with command: "perl prinseq-lite.pl -fastq samples/SRR10903401/20200102/extracted_data/R1.fastq -fastq2 samples/SRR10903401/20200102/extracted_data/R2.fastq -out_format 3 -trim_qual_right 30 -trim_qual_left 30 -trim_qual_type min -trim_qual_rule lt -trim_qual_window 10 -trim_qual_step 1 -log samples/SRR10903401/20200102/preprocessed_data/prinseq.out.log -min_len 200 -min_qual_mean 30 -ns_max_n 4 -out_good samples/SRR10903401/20200102/preprocessed_data/R -out_bad null" [prinseq-lite-0.20.4] [01/21/2022 23:24:18] Parsing and processing input data: "samples/SRR10903401/20200102/extracted_data/R1.fastq" and "samples/SRR10903401/20200102/extracted_data/R2.fastq" [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Done parsing and processing input data [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input sequences (file 1): 476,632 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input bases (file 1): 71,758,102 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input mean length (file 1): 150.55 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input sequences (file 2): 476,632 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input bases (file 2): 71,807,572 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input mean length (file 2): 150.66 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Good sequences (pairs): 0 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Good sequences (singletons file 1): 0 (0.00%) [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Good sequences (singletons file 2): 0 (0.00%) [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Bad sequences (file 1): 476,632 (100.00%) [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Bad bases (file 1): 71,758,102 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Bad mean length (file 1): 150.55 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Bad sequences (file 2): 0 (0.00%) [prinseq-lite-0.20.4] [01/21/2022 23:25:05] Sequences filtered by specified parameters: [prinseq-lite-0.20.4] [01/21/2022 23:25:05] trim_qual_left: 180134 [prinseq-lite-0.20.4] [01/21/2022 23:25:05] min_len: 773130
Please let me know,
Best regards, Wasim