PacificBiosciences / pbbioconda

PacBio Secondary Analysis Tools on Bioconda. Contains list of PacBio packages available via conda.
BSD 3-Clause Clear License
243 stars 44 forks source link

CL: Annotation of alignment classes #675

Closed mariachiara-github closed 3 months ago

mariachiara-github commented 3 months ago

Hi, I was wondering how the classification (CL) of the fusions detected by pbfusion is inferred. For example, based on what do you classify a fusion as PotentialTransSplicing, or as FUSION ? I hope my question is clear, let me know if not :)

Thank you for your help!

PB-DB commented 3 months ago

Hi,

Great question!

We classify genes as potential trans splicing as genes which have high numbers ( >= 8 by default) of candidate partners, for which none of the candidate partners overlap with the potential trans splicing gene.

We also mark these as low quality if the expected coverage on partner genes mismatches significantly, in addition to the standard pbfusion filtering criteria.

Happy to answer any further questions,

Daniel

mariachiara-github commented 3 months ago

Hi,

Great question!

We classify genes as potential trans splicing as genes which have high numbers ( >= 8 by default) of candidate partners, for which none of the candidate partners overlap with the potential trans splicing gene.

We also mark these as low quality if the expected coverage on partner genes mismatches significantly, in addition to the standard pbfusion filtering criteria.

Happy to answer any further questions,

Daniel

Thank you so much for your answer!! And what about the Fusions, Sense Antisense and Overlap classes? What are the criteria to classify fusions, found by pbfuson, in one of these classes ?

PB-DB commented 3 months ago

SenseAntisense means that a read aligned to the same locus on both strands. This often happens in eukaryotic transcription, where they are called sense-antisense chimeras. The kallikrein (KLK) genes are a particularly prominent example. There can be false positives for certain low-complexity regions as well.

Readthrough is assigned when a read aligns to two genes on the same chromosome, orientation, and relative position such that polymerase can start reading on one gene and continue reading on another. We use 100kb as a threshold, meaning that if the end of the last alignment to the first gene is less than 100kb upstream of the second, it is marked as a read-through event. These are also common. Some are even annotated in genbank. This annotation lets users distinguish this kind of event from one created by a genomic rearrangement.

Overlap is assigned when the two genes to which a read aligned overlap with each other. This is also common; often a single exon from a gene that overlaps is shared, and some of this comes from errors in mapping/alignment. Again, this doesn't require genomic rearrangements.

We use Fusion for all other events. This means the fusion cannot be explained by a read-through event, the genes do not overlap, and the gene is not aligned to both strands atthe same locus. Essentially, these events are more likely to be due to a genomic rearrangement.

PB-DB commented 3 months ago

I'm closing this for now - feel free to re-open with more questions!

We plan to extend the documentation to include this with the next release.

mariachiara-github commented 3 months ago

I'm closing this for now - feel free to re-open with more questions!

We plan to extend the documentation to include this with the next release.

Thank you so much for the exhaustive explanation about the classes! I actually do have another question, how do you discriminate between LOW and MEDIUM fusions? (the minimum fusion quality to emit). In particular what are the parameters that you look at to say that a fusion is of MEDIUM quality? Thank you again!

PB-DB commented 3 months ago

Hi Maria,

Great question.

LOW is assigned when a candidate fusion is readthrough, between overlapping genes, or fails other QC tests. We filter them by default , but --min-fusion-quality LOW causes all events to be emitted. This can be important for some fusions.

These tests are:

  1. Too many genes (> 3)
  2. Too few reads supporting (< 2).
  3. Minimum identity on either side of the breakpoint is too low (< 85%)
  4. Breakpoint median distance is too high - this means the breakpoint isn't well-defined, or there are multiple events with nearby breakpoints being grouped together. (> 1000)
  5. Minimum mapq on either side (disabled by default, but can be raised > 0 with --min-min-mapq.

These can all be tweaked via command-line options, but they work pretty well.

HIGH quality is reserved for future work, but we do not assign any currently.

Let me know what further questions you may have.

Thanks,

Daniel

mariachiara-github commented 3 months ago

Mitelman

Hi Daniel! Thank you again for the very clear and fast answer!