davidaknowles / leafcutter

Annotation-free quantification of RNA splicing. Yang I. Li, David A. Knowles, Jack Humphrey, Alvaro N. Barbeira, Scott P. Dickinson, Hae Kyung Im, Jonathan K. Pritchard
http://davidaknowles.github.io/leafcutter/
Apache License 2.0
208 stars 115 forks source link

Behavior of -p --mincluratio: unexpectedly removing intron clusters #221

Open VitorAguiar opened 2 years ago

VitorAguiar commented 2 years ago

Hi, thanks for developing leafcutter!

I'm trying to understand the impact of the option -p in the intron clustering step regarding the generation of "*_refined" from "*_pooled" (using leafcutter_cluster_regtools.py).

On a differential splicing analysis setting (cases vs. controls), I've been clustering junctions from a heterogeneous group of samples (i.e. different datasets, tissues, varying sequencing depth, etc), and comparing that with clustering on a per dataset/tissue basis. With the latter approach, I see a nice exon inclusion event in a gene of interest, matching my expectation from a differential transcript usage analysis.

However, when I do clustering on the larger and heterogeneous set of samples, I first get in the "*_pooled" file an enormous cluster with 495 junctions that includes my event of interest, but extends beyond the gene. While some junctions have only 3 reads, others have >170,000.

Then, if I use -p 0.00001, I get in the "*_refined" file a smaller cluster with 14 junctions, including my splicing event of interest, with the number of reads per junction ranging from 98 to 170,000. But the leafviz plot is noisy since I include many junctions with few supporting reads, and the delta PSI is small since it's relative to the cluster.

However, if I use -p 0.05, I lose the entire cluster in "*_refined", even though my event of interest is supported by >96,000 reads. Is that a desired behavior?

I'd love to hear from @davidaknowles, @goldenflaw et al.

Thank you! Vitor

goldenflaw commented 2 years ago

Hi Vitor,

This may be an edge case which we will try to fix in future versions. For now, if you already know which junction/cluster you are interested in. You can simply manually define clusters (I believe) by overriding the _refined file.

Best, Yang

On Mon, Nov 21, 2022 at 3:46 PM Vitor @.***> wrote:

Hi, thanks for developing leafcutter!

I'm trying to understand the impact of the option -p in the intron clustering step regarding the generation of "_refined" from "_pooled" (using leafcutter_cluster_regtools.py).

On a differential splicing analysis setting (cases vs. controls), I've been clustering junctions from a heterogeneous group of samples (i.e. different datasets, tissues, varying sequencing depth, etc), and comparing that with clustering on a per dataset/tissue basis. With the latter approach, I see a nice exon inclusion event in a gene of interest, matching my expectation from a differential transcript usage analysis.

However, when I do clustering on the larger and heterogeneous set of samples, I first get in the "*_pooled" file an enormous cluster with 495 junctions that includes my event of interest, but extends beyond the gene. While some junctions have only 3 reads, others have >170,000.

Then, if I use -p 0.00001, I get in the "*_refined" file a smaller cluster with 14 junctions, including my splicing event of interest, with the number of reads per junction ranging from 98 to 170,000. But the leafviz plot is noisy since I include many junctions with few supporting reads, and the delta PSI is small since it's relative to the cluster.

However, if I use -p 0.05, I lose the entire cluster in "*_refined", even though my event of interest is supported by >96,000 reads. Is that a desired behavior?

I'd love to hear from @davidaknowles https://github.com/davidaknowles, @goldenflaw https://github.com/goldenflaw et al.

Thank you! Vitor

— Reply to this email directly, view it on GitHub https://github.com/davidaknowles/leafcutter/issues/221, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGWTCPASP6BUBFLVCFH4NDWJPUT7ANCNFSM6AAAAAASHC6L6Y . You are receiving this because you were mentioned.Message ID: @.***>