BrooksLabUCSC / flair

Full-Length Alternative Isoform analysis of RNA
Other
211 stars 71 forks source link

Differences between the diffsplice outputs #96

Closed shamsbhuiyan closed 2 years ago

shamsbhuiyan commented 4 years ago

What is the difference between these two output files - they don't seem to be overlapping 100%: flair.diffsplice.es.events.quant.tsv and flair.diffsplice.es.events.tsv

I have noticed that there is only a *.events.tsv file for exon skipping, whereas all other alterantive splicing events have only a *.events.quant.tsv - why is that?

belgravia commented 4 years ago

There's actually no great reason why there is an extra *.events.tsv file for exon skipping events only, and since that file isn't used for anything, I will have a future version of the flair wrapper just remove that file.

As for what the difference between the two files is: the es.events.tsv file is made first, and contains the event, the number of inclusion isoforms, the number of exclusion isoforms, and the names of the inclusion and exclusion isoforms. The es.events.quant.tsv file adds in the counts_matrix.tsv information, matching the isoform names to their expression values for each sample and is formatted for DRIM-Seq input. By the way, I have started working on automating the DRIM-Seq testing for diffsplice, so hopefully that will be up sometime next week.

-Alison

shamsbhuiyan commented 4 years ago

Thanks for the quick response Alison. Would there be any reason for FLAIR to remove ES events from *.events.tsv? I see that some isoforms are in the *.events.tsv file, but not the *.es.events.quant.tsv file

Example of a gene I see in the flair.diffSplice.es.events.tsv file but not in the .es.events.quant.tsv file:


12:16558643-16558721    +       1       3       ERR3363658.510961_12:16535000   ERR2680377.775211_12:16537000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16560981-16561051    +       4       0       ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16580703-16580904    +       4       0       ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16568389-16568519    +       4       0       ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16571163-16571259    +       1       2       ERR3363660.1462133_ENSMUSG00000020593   ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000
12:16579881-16579977    +       4       0       ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16562362-16562456    +       4       0       ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16576947-16577255    +       4       0       ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16571163-16571262    +       1       2       ERR3363658.1689184_ENSMUSG00000020593   ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000
12:16553948-16554028    +       4       0       ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16573659-16573785    +       4       0       ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16541713-16541828    +       4       0       ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16540913-16541017    +       4       0       ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16546720-16546807    +       4       0       ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16565390-16565697    +       4       0       ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16563621-16563812    +       4       0       ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16544643-16544796    +       4       0       ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16564525-16564619    +       4       0       ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16558408-16558501    +       4       0       ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16547485-16547634    +       4       0       ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
12:16548876-16549003    +       4       0       ERR2680377.775211_12:16537000,ERR3363658.510961_12:16535000,ERR3363658.1689184_ENSMUSG00000020593,ERR3363660.1462133_ENSMUSG00000020593
belgravia commented 4 years ago

Samples with low expression (i.e. 0 reads in either the inclusion/exclusion of that exon) are removed. It's a line in the es_as_inc_excl_to_counts.py script:

  minVal = np.nanmin(totVals)
  if minVal < 1:
       continue

Do you see cases where you think this should be fixed? -Alison

shamsbhuiyan commented 4 years ago

When I pull out the isoforms for the gene for the above example ("ENSMUSG00000020593") from the counts_matrix.tsv, I see this:

ERR3363658.1689184_ENSMUSG00000020593   0.0     0.0     2.0     4.0     0.0     0.0     0.0     0.0     3.0     0.0     2.0     0.0     0.0
ERR3363660.1462133_ENSMUSG00000020593   0.0     0.0     1.0     2.0     0.0     1.0     0.0     0.0     4.0     0.0     2.0     0.0     0.0

While some samples have 0 reads, its clear that certain samples have more than 1 read. Granted the highest number of reads is 4, but still, it strikes me as odd to filter this one out.

belgravia commented 4 years ago

I just pushed a new commit where the above example should be reported in your *es.events.quant.tsv. There is no longer a filter at this step, and all events will be reported regardless of expression.

The new commit also DRIMSeq testing implemented, so it'll be up to the user to set filters for the minimum number of samples that support an event, the minimum number of reads, etc..

shamsbhuiyan commented 4 years ago

Thanks for doing that - would you be able to explain what you guys see as a "minimum isoform expression". From the code you guys posted, I'm not quite sure what your threshold was initially. I'm not against having a threshold to remove lowly expressed reads or samples, I was just wondering what you gusy were trying to go for.

belgravia commented 4 years ago

If an isoform is lowly expressed across the samples, then it may be difficult to draw conclusions about how significant of a role that isoform has. We decided to filter out these cases to reduce the number of tests in multiple test correction. -Alison

shamsbhuiyan commented 4 years ago

That sounds good, but more specifically, how "lowly expressed" across the samples does an isoform have to be for you consider it not likely to be important?

Jeltje commented 2 years ago

You can set this filter yourself using the drim1-4 parameters in flair diffSplice. They correspond to the following parameters in DRIMSeq:

min_samps_gene_expr = drim1
min_samps_feature_expr = drim2
min_gene_expr = drim3
min_feature_expr = drim4

Flair's defaults are in parentheses:

  --drim1 DRIM1         The minimum number of samples that have coverage over an AS event inclusion/exclusion for DRIMSeq testing;
                        events with too few samples are filtered out and not tested (6)
  --drim2 DRIM2         The minimum number of samples expressing the inclusion of an AS event; events with too few samples are filtered
                        out and not tested (3)
  --drim3 DRIM3         The minimum number of reads covering an AS event inclusion/exclusion for DRIMSeq testing, events with too few
                        samples are filtered out and not tested (15)
  --drim4 DRIM4         The minimum number of reads covering an AS event inclusion for DRIMSeq testing, events with too few samples are
                        filtered out and not tested (5)

Please reopen this ticket if you have any more questions.