Closed shamsbhuiyan closed 2 years ago
There's actually no great reason why there is an extra *.events.tsv
file for exon skipping events only, and since that file isn't used for anything, I will have a future version of the flair wrapper just remove that file.
As for what the difference between the two files is: the es.events.tsv file is made first, and contains the event, the number of inclusion isoforms, the number of exclusion isoforms, and the names of the inclusion and exclusion isoforms. The es.events.quant.tsv file adds in the counts_matrix.tsv information, matching the isoform names to their expression values for each sample and is formatted for DRIM-Seq input. By the way, I have started working on automating the DRIM-Seq testing for diffsplice, so hopefully that will be up sometime next week.
-Alison
Thanks for the quick response Alison. Would there be any reason for FLAIR to remove ES events from *.events.tsv
? I see that some isoforms are in the *.events.tsv
file, but not the *.es.events.quant.tsv
file
Example of a gene I see in the flair.diffSplice.es.events.tsv
file but not in the .es.events.quant.tsv file:
12:16558643-16558721 + 1 3 ERR3363658.510961_12:16535000 ERR2680377.775211_12:16537000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16560981-16561051 + 4 0 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16580703-16580904 + 4 0 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16568389-16568519 + 4 0 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16571163-16571259 + 1 2 ERR3363660.1462133_ENSMUSG00000020593 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000
12:16579881-16579977 + 4 0 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16562362-16562456 + 4 0 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16576947-16577255 + 4 0 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16571163-16571262 + 1 2 ERR3363658.1689184_ENSMUSG00000020593 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000
12:16553948-16554028 + 4 0 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16573659-16573785 + 4 0 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16541713-16541828 + 4 0 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16540913-16541017 + 4 0 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16546720-16546807 + 4 0 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16565390-16565697 + 4 0 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16563621-16563812 + 4 0 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16544643-16544796 + 4 0 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16564525-16564619 + 4 0 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16558408-16558501 + 4 0 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16547485-16547634 + 4 0 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16548876-16549003 + 4 0 ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
Samples with low expression (i.e. 0 reads in either the inclusion/exclusion of that exon) are removed. It's a line in the es_as_inc_excl_to_counts.py
script:
minVal = np.nanmin(totVals) if minVal < 1: continue
Do you see cases where you think this should be fixed? -Alison
When I pull out the isoforms for the gene for the above example ("ENSMUSG00000020593") from the counts_matrix.tsv, I see this:
ERR3363658.1689184_ENSMUSG00000020593 0.0 0.0 2.0 4.0 0.0 0.0 0.0 0.0 3.0 0.0 2.0 0.0 0.0
ERR3363660.1462133_ENSMUSG00000020593 0.0 0.0 1.0 2.0 0.0 1.0 0.0 0.0 4.0 0.0 2.0 0.0 0.0
While some samples have 0 reads, its clear that certain samples have more than 1 read. Granted the highest number of reads is 4, but still, it strikes me as odd to filter this one out.
I just pushed a new commit where the above example should be reported in your *es.events.quant.tsv. There is no longer a filter at this step, and all events will be reported regardless of expression.
The new commit also DRIMSeq testing implemented, so it'll be up to the user to set filters for the minimum number of samples that support an event, the minimum number of reads, etc..
Thanks for doing that - would you be able to explain what you guys see as a "minimum isoform expression". From the code you guys posted, I'm not quite sure what your threshold was initially. I'm not against having a threshold to remove lowly expressed reads or samples, I was just wondering what you gusy were trying to go for.
If an isoform is lowly expressed across the samples, then it may be difficult to draw conclusions about how significant of a role that isoform has. We decided to filter out these cases to reduce the number of tests in multiple test correction. -Alison
That sounds good, but more specifically, how "lowly expressed" across the samples does an isoform have to be for you consider it not likely to be important?
You can set this filter yourself using the drim1-4 parameters in flair diffSplice. They correspond to the following parameters in DRIMSeq:
min_samps_gene_expr = drim1
min_samps_feature_expr = drim2
min_gene_expr = drim3
min_feature_expr = drim4
Flair's defaults are in parentheses:
--drim1 DRIM1 The minimum number of samples that have coverage over an AS event inclusion/exclusion for DRIMSeq testing;
events with too few samples are filtered out and not tested (6)
--drim2 DRIM2 The minimum number of samples expressing the inclusion of an AS event; events with too few samples are filtered
out and not tested (3)
--drim3 DRIM3 The minimum number of reads covering an AS event inclusion/exclusion for DRIMSeq testing, events with too few
samples are filtered out and not tested (15)
--drim4 DRIM4 The minimum number of reads covering an AS event inclusion for DRIMSeq testing, events with too few samples are
filtered out and not tested (5)
Please reopen this ticket if you have any more questions.
What is the difference between these two output files - they don't seem to be overlapping 100%:
flair.diffsplice.es.events.quant.tsv
andflair.diffsplice.es.events.tsv
I have noticed that there is only a
*.events.tsv
file for exon skipping, whereas all other alterantive splicing events have only a*.events.quant.tsv
- why is that?