PacificBiosciences / pbbioconda

PacBio Secondary Analysis Tools on Bioconda. Contains list of PacBio packages available via conda.
BSD 3-Clause Clear License
249 stars 44 forks source link

Isoseq collapse filtering out criteria #664

Closed MengjunWu closed 3 months ago

MengjunWu commented 6 months ago

Hi, I have some problems with isoseq collapse. While most of my reads (90%) are mapped, almost half of them are filtered out after isoseq collpase. I was wondering how do you calculate coverage and identify? I am using the mg tag to get the identity, and calculating coverage per read as number of matches and mismatches in the cigar string divided by the read length, but I get much less reads filtered out than by isoseq collapse with the same thresholds. Are either coverage or identity calculated differently?

Many thanks Mengjun

armintoepfer commented 6 months ago

Assigning to @jmattick

jmattick commented 3 months ago

Hi @MengjunWu, collapse filters based on the following:

  1. Read is mapped
  2. Read is a primary alignment
  3. Read meets the minimum coverage (aligned end - aligned start) / (read length)
  4. Read meets the minimum identity (matches / (matches + mis-matches + inserted bases + deleted bases)
  5. Optional: If using single-cell workflow, read must be marked as coming from a real cell using the rc tag.

These minimum values can be changed using the following options.

Alignment Filter Options:
  --min-aln-coverage              FLOAT  Ignore alignments with less than minimum query read coverage. [0.99]
  --min-aln-identity              FLOAT  Ignore alignments with less than minimum alignment identity. [0.95]