PacificBiosciences / pbbioconda

PacBio Secondary Analysis Tools on Bioconda. Contains list of PacBio packages available via conda.
BSD 3-Clause Clear License
247 stars 43 forks source link

lima summary and lots of min pass filtered reads #626

Closed aherman-umn closed 9 months ago

aherman-umn commented 9 months ago

Hi there,

Looking at my lima *.summary output for my IsoSeq data I'm getting a sizable number of reads not passing the minimum pass filter. I was under the impression that the --min-passes 0 was the default, is that overridden by --isoseq? I see FullPass : false when I run

lima M145_reads.bam primers.fa M145_reads.fl.bam --isoseq --peek-guess -j 128 --log-level TRACE

And pretty much everything is single pass...

cut -f 28 M145_reads.fl.lima.report | sed '1d' | sort | uniq -c
    139 0
4980823 1

But I've got 1.75M reads getting filtered out?

ZMWs input                (A) : 6726107
ZMWs above all thresholds (B) : 3976716 (59.12%)
ZMWs below any threshold  (C) : 2749391 (40.88%)

ZMW marginals for (C):
Below min length              : 260 (0.01%)
Below min score               : 0 (0.00%)
Below min end score           : 345970 (12.58%)
Below min passes              : 1745284 (63.48%)
Below min score lead          : 0 (0.00%)
Below min ref span            : 302101 (10.99%)
Without SMRTbell adapter      : 139 (0.01%)
Undesired 5p--5p pairs        : 631173 (22.96%)
Undesired 3p--3p pairs        : 86127 (3.13%)
Undesired no hit              : 139 (0.01%)

ZMWs for (B):
With different pair           : 3976716 (100.00%)
Coefficient of correlation    : 0.00%

ZMWs for (A):
Allow diff pair               : 4980823 (74.05%)
Allow same pair               : 4980823 (74.05%)

Reads for (B):
Above length                  : 3976716 (100.00%)
Below length                  : 0 (0.00%)

Any insight is greatly appreciated!

aherman-umn commented 9 months ago

Ah, I've got it, Below min passes is np:i:0 from the CCS bam + NumPasses 0 from lima, which is scored based on adapter structure in the read and is only determined for np:i:n > 0.

samtools view M145_reads.bam | cut -f 12- | tr '\t' '\n' | grep 'np:i:' | sort | uniq -c | sort -k2,2V | head
1745145 np:i:0
 362574 np:i:1
 263808 np:i:2
 234643 np:i:3
 200416 np:i:4
 173981 np:i:5
 156908 np:i:6
 142309 np:i:7
 130874 np:i:8
 121740 np:i:9

So there are 1745145 zero pass CCS reads and if we add the 139 zero pass fl reads from lima we get the Below min passes: 1745284 value. Zero pass CCS reads are counted for the .summary file but are not included in the .report file, which is what confused me!