pbfusion output description

gbonilla18 commented 1 year ago

Hi,

Where can I get more details about the read-through status annotation? I see on the bed file header that the read-through status can be Fusion, ReadThrough, Unannotated, or SenseAntisense. I want to know how those are defined.

Thanks

PB-DB commented 1 year ago

Hi -

Thanks for reaching out! I realize we could use a better description.

We have 4 lower-confidence categories (ReadThrough, Unannotated, SenseAntisense, and Overlap), and the Fusion category, which is everything else. There are real fusions in these categories, but they also tend to be sources of false positives.

ReadThrough:

If the two breakpoint ends are on the same contig, in the same orientation and within 100Kb (tunable by --max-readthrough), they are considered to be a read-through event. Often exons from multiple nearby genes end up being spliced together, and this can make for some noisy information.

SenseAntisense:

SenseAntisense is very much like ReadThrough, except with breakpoints on opposite strands. There are real biological signals in some of these (particularly notable in Kallikrein genes), but, like read-through, can be products of the complexity of RNA transcription in Eukaryotes.

Overlap:

If the two genes in question overlap (same strand + contig), then we mark it as an Overlap case. It's similar to readthrough.

Unannotated:

If one of the exons in question is aligned to a segment which does not match an exon in the provided GTF annotation file, we mark it as Unannotated.

If a breakpoint pair doesn't fall into any of these categories, we call it FUSION. IE, it's a category of exclusion. By default, we prioritize events which have at least one breakpoint pair which doesn't fall into any of the above categories.

We're working on a FAQ/vizualization page to describe this better, and we'll add more to the README to account for this as well.

Good luck, and we'll be happy to answer any further questions!

Best,

Daniel

PB-DB commented 1 year ago

Here are some visualizations for these categories.

Read-through:

Overlap:

Unannotated:

Sense-Antisense

gbonilla18 commented 1 year ago

Hi Daniel,

Thank you so much! These are pretty clear. Now, I have another question. I am planning on using these fusions to identify circular DNA. I know this is a separate issue, but I thought it might be worth checking with you if you can recommend any circular-DNA tools that are appropriate for PacBio reads (instead of short reads). Thanks, again!

Gracia

PB-DB commented 1 year ago

Sure!

First thing that comes to mind is Ivan Sović's Raptor (https://github.com/isovic/raptor) which can align against a circular reference. I've seen it used for mitochondrial genomes. I'd look at Demos 2 and 3.

t's more useful if you have a reference and want to perform an alignment through the circular end of the contig. So you'd have to identify the circular regions first.

Apart from that, you'd need to already identify the circular sequences. You could manually look for this using an overlap graph + looking for cycles, but this is expensive and difficult. I've seen https://github.com/visanuwan/cresil and cider-seq (https://github.com/devang-mehta/ciderseq2) used for for long read circular dna discovery but have not personally used either.

I imagine a best approach would be to use something like cresil for circular dna discovery, align reads to it using raptor, generating a consensus and then genotyping, with the caveat that I haven't done this.

Thanks,

Daniel

PacificBiosciences / pbbioconda

pbfusion output description #597