GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data
GNU General Public License v3.0
171 stars 22 forks source link

Discrepancy gene counts (bambu) with less and more sequencing depth #393

Closed NikoLichi closed 8 months ago

NikoLichi commented 9 months ago

Hi there, I have been using bambu lately and I really like it!

However, when performing differential gene expression analysis, I noticed some discrepancies between bambu counts. It may be a bit similar to the issue 281.

I have two conditions (Control and stimulated treatment), and have three replicates per sample and we performed one extra sequencing round to increase sequencing depth. I ran bambu with all the samples together: the first sequencing round (i.e., FC1), the second round (FC4) and the merged BAM files (i.e., Merg_).

When running differential gene expression and comparing the results of FC1 vs Merged samples, I noticed some genes coming up in the FC1 not present in the Merg_ samples. So, I checked the counts and noticed a discrepancy between some of those genes (Please see the highlighted genes in the table below). As you can see, for some genes there is a difference between 1-4 reads, but for others the difference is quite high! (i.e., Mer_P3stimulated=13 vs FC1P3stimulated=105). Anyway, the counts on the **Merg samples should be more than in FC1**. What would happen here?

Screenshot 2023-09-20 at 15 56 04

I also checked this weird scenario for BambuGene27716 using IGV. Indeed, there are more reads supporting the Merged sample (see IGV figure bottom sample) than in the FC1 (top sample IGV). Screenshot 2023-09-20 at 14 57 17

I hope to be clear :) Any hint on how to solve this is very welcome!

All the best, Niko

andredsim commented 9 months ago

Hi Niko,

To summarise your issue, the counts from a combined bam are not equaling the sum of the counts from the two individual bams.

This is not wholly unexpected, as first we group reads together into read classes (by sample) which are then mapped to transcripts/genes. As the bams are combined, a different group of reads are now being combined so it is possible the read class boundaries would be slightly different. These different boundaries can then lead to read classes being assigned to different transcripts and gene in cases where they overlap with multiple annotations. This is especially true for reads without any splice junctions as these are harder to cleanly group and assign. So the instances of there being ~5 difference is not too concerning.

I notice the gene with the largest discrepancy is a newly discovered gene. To understand why this is occurring could you explain how you did transcript discovery? Was Bambu run in one command with all samples together, or did you do discovery first without the merged samples? Does this newly found gene overlap with any other genes? What I believe might be happening is a gene was discovered for FC_P1stimulated and FC_P4stimulated, and similar gene with slightly different boundaries was discovered in Merg_P1stimulated, resulting in them each being assigned to their own versions. If this is the case, this could be solved by first running bambu discovery only (quant = FALSE) with either just the unmerged bams, or just the merged bams. Then take that output and rerun bambu with discovery = FALSE, and annotations = discoveryOutput. I will put this into code below so it is clearer.

extendedAnnotations = bambu(reads = unmergedBam, annotations = annotations, genome = fa.file, quant = FALSE)
se = bambu(reads = allBams, annotations = extendedAnnotations, genome = fa.file, discovery = FALSE)

Let me know if you have any further questions, and hopefully we can resolve this! Kind Regards, Andre Sim

NikoLichi commented 9 months ago

Hi Andre,

Thanks for the reply and suggestions. Understood the issue with the instances with ~5 differences. From the paper, "In the second step, read classes from all samples are combined", I somehow understood that these read classes were merged/pruned , but now it is clear.

Yes, Bambu was run in one command with all the samples together (individual samples from both runs and also the merged bams). I also leave the NDR to set automatically (0.466). I also checked and indeed this novel BambuGene overlaps with another large gene (ENSG00000145901.16 : 151029945 -151093577), which is located in the negative strand, while this new bambu gene (BambuGene27716 : 151081181 - 151082952) is in the positive strand. Does it have an impact?

Thanks for the hint on how to run bambu first under discovery mode only and then using the discovered annotations for the quantification step. I will run it as you mentioned. This could work with small data sets and in particular in my case of re-sequencing (i.e., better to take the merged bams). However, what if there is a large data set (no re-sequenced samples)? Should I run all my samples in one bambu line (as I did before) allowing for transcript discovery and abundance estimation at the same time, or is it better to always run fewer samples (including samples with potential transcript variation, e.g. tissues/treatments/etc) in discovery mode and then abundance estimation?

I'll update you soon as I run bambu with the instructed command on the merged bams ;)

All the best, Niko

andredsim commented 9 months ago

Hi Niko,

Yes having two genes on opposite strands can be impactful as the ability to strand reads is not 100%. This would lead some reads to be assigned to the "wrong" strong and this could explain the discrepancy. Perhaps the read class derived from the merged sample were stranded differently from the individual samples.

Generally it is fine to run Bambu in one line. Running Bambu discovery and quantification separately doesn't make a difference to the output or runtime unless you customize the annotations provided to the quantification step as in the example I gave earlier. However, whether you want to include different sample conditions together during discovery and quantification is up to the experimental design as both ways have their advantages and disadvantages. By combining them all together and generating a pan-annotation set it means you can compare quantification for novel transcripts easier as they will have unified ids. This comes with the cost of introducing ambiguity in read transcript assignments if there are novel transcripts not expressed in some of the samples.

Hope this helps, and let me know how it goes! Kind Regards, Andre Sim

NikoLichi commented 9 months ago

Hi Andre,

Thanks again for your ideas and for clarifying how to properly run bambu. I will need to run bambu with all the samples for the large cohort.

Coming back to the novel Bambu gene. I re-ran bambu, as you mentioned, using the merged samples as input since this will have all potential novel transcripts. Bambu reported a novel discovery rate (NDR) of: 0.227, which is lower than the previous run (0.466). I only focused on the gene region mentioned above, and bambu detected a novel gene again in the positive strand (BambuGene15851, chr5: 151057858-151059914) and as you can see in the attached figure again for that particular sample P3stimulated the numbers of the merged samples vs the first run (FC1). So... the issue is still happening. This could be due to the strandedness issue you mentioned before...

Screenshot 2023-09-22 at 20 24 15

What do you think could be done to solve this? Running bambu in discovery mode with the FC1 (instead of the merged samples) and then using those annotations for the quantification mode?

All the best, Niko

andredsim commented 9 months ago

Hi Niko,

I would expect a lower NDR since from what I gather this is a human sample, so less novel transcripts are expected. Therefore I would not worry the novel discovery rate is lower (generally we expect around .1-.2 for human data).

I see the difference now between FC1+FC4 vs Merged is lower than previously where it was >100. The open question is whether having this discrepancy is an issue or not. While it seems intuitive that merging two bam files should equal the sum of the counts of them run separately, by running them separately, bambu has less power to construct read classes and is more likely to assign counts differently. Therefore the existence of a discrepancy between FC1+FC4 vs Merged is not unexpected and we can not be certain if this alternative stranded transcript is an issue, or represents the truth without a ground truth.

Therefore this changes the question to, "which is more likely to be correct?". I conferred with Dr. Chen Ying who worked on the quantification part of Bambu and we agreed that as runs from FC1 and FC4 are the same sample type, the counts from the merged bam file are more likely to be correct. Therefore, we would recommend just using counts from the merged bams for downstream analysis.

To clarify that would mean running bambu like so se = bambu(reads = c('P1C_merged.bam', 'P2C_merged.bam', 'P3C_merged.bam', 'P1stimulated_merged.bam', 'P2stimulated_merged.bam', 'P3stimulated_merged.bam')), annotations = annotations, genome = fa.file

Let me know if this clarifies this for you and enables you to progress with your data, Kind Regards, Andre Sim

NikoLichi commented 8 months ago

Hi Andre,

Thanks a lot for your input, and sorry for my late reply. I went to a conference and then forgot to reply to you.

Yes, I also agree that the merged sample bams are likely to be correct, but then making the comparison with less (FC1) and more sequencing rounds (merged FC1 and FC4) using bambu would be difficult... I guess I will have to use another tool only for this question.

For this particular issue/transcript, I think, the issue falls in the strandedness... I hope bambu would have a way to pounder the strandedness in a region and assign reads "correctly" to the novel or legacy annotations.

Thanks again for your help. I will continue using bambu and report any other potential issues.

All the best, Niko