Kingsford-Group / scallop

Scallop is a reference-based transcriptome assembler for RNA-seq
BSD 3-Clause "New" or "Revised" License
89 stars 18 forks source link

Question about exon extension events #26

Open aleighbrown opened 4 years ago

aleighbrown commented 4 years ago

Hello,

Scallop seems to not be detecting some exon extension events properly, I've run using the default settings, except for --min_transcript_length_increase which I had decreased down to 15.

See below for an example, there are clearly 2 exons supported by the junctions, one of which is a 3' extension of the other, but Scallop only reports the longer of the 2 exons.

image

Are there some additional settings which I need to fiddle about with? (e.g. I see this flag for bundle gap, should I bring that down?

--min_bundle_gap | 50 | the minimum distances required to start a new bundle

Thanks!

aleighbrown commented 4 years ago

I also tried merging all the bams from this condition together and running scallop on the merged bam,

image

However this results in only the shorter of the 2 exons being reported.

aleighbrown commented 4 years ago

There's also a few novel exons completely missed by Scallop which considering it picks up other events I'm a little perplexed by.

image

shaomingfu commented 4 years ago

Hi, Scallop will determine possible false-positive junctions and remove them. (Specifically, if a vertex, i.e., partial-exon, in the splice graph has multiple in-edges and multiple out-edges, then the edge with lowest read-support will be removed if this edge has no phasing-support.) The purpose of this step is to reduce false-positive rate as RNA-seq data is noisy. But unfortunately, this will also sometimes remove true junctions.

With that said, what you've observed might be the normal behavior of Scallop (especially if the missing junctions have low read-coverage). But I'll look into your examples to see if some parameters can be tuned to increase Scallop's sensitivity in your need.

Mingfu

shaomingfu commented 4 years ago

I would suggest you try two things to increase sensitivity:

  1. Scallop will filter out these assembled transcripts with very low coverage. This is mainly controlled by the parameter --min_transcript_coverage. Its default value is 1. You can try to set it as, say, 0.01.

2, if you have multiple runs of the same sample, you can try to assemble them separately, and then union their assembled transcripts. The gtfmerge tool in https://github.com/Kingsford-Group/rnaseqtools can do the union work.

Mingfu

shaomingfu commented 4 years ago

The last example (exon-missing) looks interesting to me. Can you share me a picture that includes the entire gene loci?

aleighbrown commented 3 years ago

Screenshot 2021-02-16 at 16 44 26

Whoops apologies got sidetracked from this project.

Does this help? I've done the gtfmerge on my replicates already, will toy now with bringing down min_transcript_covers