BioInf-Wuerzburg / proovread

PacBio hybrid error correction through iterative short read consensus
MIT License
60 stars 20 forks source link

SeqFilter error #102

Closed SabLeCam closed 6 years ago

SabLeCam commented 7 years ago

Hello, Like rebzzy ( #100 ) I've bee trying to follow the directions you advised in the following thread https://www.biostars.org/p/155236/ I have aligned paired-end miseq reads (2x300bp) on a draft genome of close species (6 chr, 1.9GB) to extract consensus sequences from my mapped reads. While bam2cns ran correctly, SeqFilter command is killed when max RAM is reached (128G). SeqFilter --in miseq_cns.fq --out miseq_cns.trimmed.fa \ --fasta --min-length 294 \ --phred-offset 33 --trim-win 1,1

I have tried chuncking my miseq_cns.fq file in 3 (the smallest is 824MB) and re-running SeqFilter, but the error is the same (zsh : Killed SeqFilter --in miseq_cns.fq --out miseq_cns.trimmed.fa --fasta --min-length 294 --phred-offset 33 --trim-win 1,1)

Any idea of what I've been doing wrong would be greatly appreciated Best Sab

thackl commented 7 years ago

Hi Sab,

I don't think that you are doing anything wrong. The problem is that I wrote SeqFilter to primarily trim PacBio reads and didn't optimize it to be memory-efficient on very long sequences, such as entire chromosomes.

The main problem seems to be the --trim-win algorithm. Try to use '--trim-lcs 1,40,294' instead. While --trim-win uses window-average qualities, --trim-lcs uses a hard cutoff, i.e. a sequence gets split at every base below that threshold. It is much more efficient. I just tested it on a 100 Mbp sequence and had no problems.

'--trim-lcs 1,40,294' means to keep all bases with phred-scores between 1 and 40 (max), and a minimum length of 294.

Phred-scores from bam2cns are roughly linked to coverage

Cov Phred 0 0 1 7 2 10 3 12 4 14 5 16 6 17 ...

so '--trim-lcs 1,40,294' will give you regions with more than 0X coverage, i.e. 1X, '--trim-lcs 10,40,294' with 2X, '--trim-lcs 12,40,294' with 3X and so on.

Hope that helps Cheers Thomas

SabLeCam commented 7 years ago

Thanks a bunch thomas! I'll try that right away and keep you informed Best Sab

SabLeCam commented 7 years ago

Hello again, I have tested the --trim-lcs 1,40,294 option you've advised on the smallest cns file (824MB)and it still isn't working unfortunately: SeqFilter --in miseq_cns.003.fq --out miseq_cns.003.trimmed.fa \ --fasta --min-length 294 \ --phred-offset 33 --trim-lcs 1,40,294 [17:14:14] /home/jean-baptiste/proovread/bin/SeqFilter-1.06 [17:14:14] --in: miseq_cns.003.fq [17:14:15] Detected FASTQ format, phred-offset 33 708M [=================================================> ]zsh: killed

I guess I might have to start over again from a "chuncked" reference, do you think that could do the trick? Best, Sab