comprna / SUPPA

SUPPA: Fast quantification of splicing and differential splicing
MIT License
263 stars 62 forks source link

SUPPA and StringTie2 multi-samples #79

Closed nirkoty closed 2 years ago

nirkoty commented 4 years ago

Hi,

I use StringTie2 to assemble rna from samples in 2 conditions, with 3 replicates in each condition. I end up with a gtf file like this one for each sample: image

When running SUPPA generateEvents it takes the gene_id and transcript_id, therefore STRG.3148 and STRG.3148.1 for example and as a result my ioe looks like this:

chr1 STRG.57 STRG.57;A5:chr1:1391575-1392679:1391575-1392790:- STRG.57.2 STRG.57.7,STRG.57.2,STRG.57.9

However, these ids are specific to one assembly, therefore, when comparing between samples and conditions, the ids don't match. I managed to get the psi for all the samples by running psiPerEvent for each sample individually and then merging the dataframes based on the event id without the gene, for example A5:chr1:1391575-1392679:1391575-1392790:-

However I am stuck at the next step, diffSplice, as the tpm and ioe files are for combined samples, but then the ids won't match across samples. I just need to get a p-value for each event to be able to do some filtering.

What should I do?

Thanks in advance for your help.

EduEyras commented 4 years ago

Hi,

Thanks for your query. I can see several options:

1) If StringTie2 assigns abundances to a known annotation you might be able to use those. But this only works if you already have an annotation.

2) Does StringTie provide a way to merge predictions across experiments? That could help to match transcript IDs across experiments. Cufflinks used to have Cuffmerge, although I don't know whether it was good enough at that.

3) A better one would be to pool all samples and run them through StringTie2 to generate an annotation. Then re-run StringTie2 per sample with that annotation. Transcripts with no expression in a sample would be zeroes (they may end up with very few reads though). Then you can use the same set of IDs across all experiments and run SUPPA.

I hope this helps. Please let me know how it goes

Best

Eduardo

ajw2329 commented 4 years ago

Hello,

I thought I'd chime in since I am a bit familiar with stringtie, and have used it before to supplement annotations for use with SUPPA.

stringtie does indeed have a merge tool, and I believe the recommended workflow is the following: 1) run stringtie individually on all samples (as has already been done in this case).
2) run stringtie merge with all output GTFs as input (e.g. stringtie --merge -G annotation.gtf -o stringtie_merged.gtf /path/to/txt/with/gtf_paths.txt).
3) Re-run stringtie in quantification mode on all samples, using the merged GTF as the reference e.g. stringtie –e -G stringtie_merged.gtf -o sample.gtf sample.bam. Note that the -e flag disallows transcript assembly, so in step 3 stringtie will only be determining TPM values for your pan-experiment consensus transcripts generated by stringtie merge.

At this point you should be able to run diffSplice since you will no longer have the ID issue.

Check out the stringtie protocol paper (https://www.nature.com/articles/nprot.2016.095) and manual (https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual) for more detail. Hope this helps!

EduEyras commented 4 years ago

Great! Thanks a lot for the comment.

I am curious, is it better (as in more accurate) to run the merge and quantify, than to build the annotation first with all samples together? I know the merge is tricky and can lead to very arbitrary choices.

I was wondering whether they did that comparison, and that's how they stuck with merge.

cheers

Eduardo

ajw2329 commented 4 years ago

I'm not really sure - I don't remember this being addressed in either of the StringTie papers.

However, I did recall and manage to dig up a seqanswers post from Cole Trapnell back in 2008, in which he briefly addresses this: http://seqanswers.com/forums/showpost.php?p=62338&postcount=9 (post) and http://seqanswers.com/forums/showthread.php?t=16422 (thread). He raises the concern of performance when attempting to assemble transcripts from a concatenated BAM using cufflinks. Granted the performance of stringtie is enormously improved over cufflinks, but it's probably still a fair point with very large experiment sizes. He also tantalizingly mentions some concerns over accuracy with the pre-merged bam approach, but doesn't elaborate.

One possible situation I can imagine is if different groups of samples produce largely or exclusively different isoforms in a particular gene or set of genes. Performing the assembly separately might increase the chance of successfully assembling the distinct isoforms. If all reads from all samples have to be considered simultaneously the task might grow more difficult (depending on the nature of the isoforms in question and the ways in which they differ). If that's correct though, then maybe a combined approach might work well. One could try concatenating bams from different sample groups and generating a preliminary assembly for each one, and then running stringtie merge on the resulting several GTFs. This is all really just a guess though. If I ever get around to playing around with this idea I will let you know how it goes!

Best, Andrew

EduEyras commented 4 years ago

Thanks for email!

I can imagine the arcane reasons. I now recall doing the pooling before the assembly once before and cufflinks predicting transcripts in regions that accumulated enough background reads to look reasonable. It will probably require careful tuning. Still, something that is now more affordable computationally to test.

best

Eduardo

On Tue, 28 Apr 2020 at 13:16, ajw2329 notifications@github.com wrote:

I'm not really sure - I don't remember this being addressed in either of the StringTie papers.

However, I did recall and manage to dig up a seqanswers post from Cole Trapnell back in 2008, in which he briefly addresses this: http://seqanswers.com/forums/showpost.php?p=62338&postcount=9 (post) and http://seqanswers.com/forums/showthread.php?t=16422 (thread). He raises the concern of performance when attempting to assemble transcripts from a concatenated BAM using cufflinks. Granted the performance of stringtie is enormously improved over cufflinks, but it's probably still a fair point with very large experiment sizes. He also tantalizingly mentions some concerns over accuracy with the pre-merged bam approach, but doesn't elaborate.

One possible situation I can imagine is if different groups of samples produce largely or exclusively different isoforms in a particular gene or set of genes. Performing the assembly separately might increase the chance of successfully assembling the distinct isoforms. If all reads from all samples have to be considered simultaneously the task might grow more difficult (depending on the nature of the isoforms in question and the ways in which they differ). If that's correct though, then maybe a combined approach might work well. One could try concatenating bams from different sample groups and generating a preliminary assembly for each one, and then running stringtie merge on the resulting several GTFs. This is all really just a guess though. If I ever get around to playing around with this idea I will let you know how it goes!

Best, Andrew

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/79#issuecomment-620353889, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB4NOHBJNS3DN2CXETDROZDATANCNFSM4MSAPPRQ .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

nirkoty commented 4 years ago

Thank you both for your replies. I am using the merge option in StringTie, however as Eduardo mentioned, it is sometimes arbitrary and I have noticed several issues with it, for example different genes being merged into one. That's why from my experience I prefer to work with gene ids specific to one sample assembly.

Even after doing the assembly again against the merged gtf, some transcript are not not mapped to it (I still don't understand why thought, I use the most flexible parameters possible during the merge). With the -e option, these transcripts will be ignored. It would be okay if they were rare and with low TPM. However I've noticed there are quite a few of them and some of them are very abundant, some of them are sometimes by far the most abundant transcript in the gene, in terms of TPM. Therefore I don't think it should be ignored.

I had not heard of combining bam files before and that could be a good option. I'm not an expert in StringTie2, I've only been using it for a few months, but from my understanding, it build transcripts using an exons graph. So let's say in one sample you have isoform A more abundant than isoform B, and in the other it's the opposite, maybe the 2 will average out and lead StringTie to think there is only one transcript? I'm note sure, maybe I can give it a try.

The other idea I had while looking at SUPPA2 code was to hack it a bit to support multiple ioe and transcript files.

Let's say we end up with a list of splicing events and their psi across different samples (that's the point I've reached so far)

For example (note that the gene id has been removed from the event name) image

Then I could pass a list of ioe files. For each event, the program will look for the event and retrieve the local assembly transcript ids by iterating through the different ioe files: image

Then it is easy to retrieve the transcript abundance and I can just send it back to the rest of the SUPPA pipeline. It's just a small hack, instead of reading the transcript abundance from one ioe and transcript file, we read it from multiple, with one per sample.

Do you think it makes sense? I haven't implemented it yet, but by having a look at the SUPPA code, I think that's doable.

Best,

Esteban

EduEyras commented 4 years ago

Hi Esteban,

I like your proposal to hack the SUPPA code to do that. I had that in mind, but did not suggest it in case it would become too complex. And there were easier options. It might even not require changing the code.

The .ioe simply states the abundances of transcripts contributing to the inclusion or skipping of an event. So you can summarize that in new transcript ids:

PSI = T1 / (T1+T2), where T1+T2 is the sum of the abundances for the last column in the .ioe, so T2 = sum - T1

Then you can create you abundance file for all those made-up transcripts T1, T2, ....

SUPPA will basically read the PSIs, and the abundances for T1, T2,..... without worrying too much about whether these are real transcripts in a GTF (lack of this check makes it very flexible).

I think it could work.

About using StringTie2 with merged BAMs, I think it should be ok for the example you mention. You would have evidence for two transcripts, and both should appear in the output if they have different (enough) exon-intron structures. This will be helped of course by the spliced short-reads or the long-reads if you have them. The key of this process is that you do not use these abundances, but just the annotation. And then recalculate abundances with each sample.

There was a paper where they also proposed to drop transcripts from the annotation that had low abundance in a sample, and then recalculate the abundances to improve the estimates (I think it was this one: https://www.pnas.org/content/112/23/E3050). This could be done after the recalculation of abundances from the merged annotation. It will use a common assembly, but will minimize the issues of quantifying transcripts with too many transcripts in the annotation (it spreads out too much the abundance over possibly non relevant transcripts for that sample).

Let me know how it goes

Best

Eduardo

On Wed, 29 Apr 2020 at 01:42, nirkoty notifications@github.com wrote:

Thank you both for your replies. I am using the merge option in StringTie, however as Eduardo mentioned, it is sometimes arbitrary and I have noticed several issues with it, for example different genes being merged into one. That's why from my experience I prefer to work with gene ids specific to one sample assembly.

Even after doing the assembly again against the merged gtf, some transcript are not not mapped to it (I still don't understand why thought, I use the most flexible parameters possible during the merge). With the -e option, these transcripts will be ignored. It would be okay if they were rare and with low TPM. However I've noticed there are quite a few of them and some of them are very abundant, some of them are sometimes by far the most abundant transcript in the gene, in terms of TPM. Therefore I don't think it should be ignored.

I had not heard of combining bam files before and that could be a good option. I'm not an expert in StringTie2, I've only been using it for a few months, but from my understanding, it build transcripts using an exons graph. So let's say in one sample you have isoform A more abundant than isoform B, and in the other it's the opposite, maybe the 2 will average out and lead StringTie to think there is only one transcript? I'm note sure, maybe I can give it a try.

The other idea I had while looking at SUPPA2 code was to hack it a bit to support multiple ioe and transcript files.

Let's say we end up with a list of splicing events and their psi across different samples (that's the point I've reached so far)

For example (note that the gene id has been removed from the event name) [image: image] https://user-images.githubusercontent.com/19326121/80506046-006ff800-896d-11ea-905e-e48f5253eb6c.png

Then I could pass a list of ioe files. For each event, the program will look for the event and retrieve the local assembly transcript ids by iterating through the different ioe files: [image: image] https://user-images.githubusercontent.com/19326121/80506308-4c22a180-896d-11ea-967e-8295522da159.png

Then it is easy to retrieve the transcript abundance and I can just send it back to the rest of the SUPPA pipeline. It's just a small hack, instead of reading the transcript abundance from one ioe and transcript file, we read it from multiple, with one per sample.

Do you think it makes sense? I haven't implemented it yet, but by having a look at the SUPPA code, I think that's doable.

Best,

Esteban

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/79#issuecomment-620686231, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKBYRWRJQJPHLUUXJ4XLRO32NPANCNFSM4MSAPPRQ .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

EduEyras commented 2 years ago

I hope these strategies worked!