Closed SichongP closed 2 months ago
I've also tried -ndd
to disable removing duplicates in RT mode but that didn't change the results. My bam file already has duplicates identified based on UMI and removed. No reads have duplicate flag in input bam file.
Hi @SichongP,
thanks for your message. So, in order to understand the reason for such differences, I will have to look at the BAM file. Can you share it, or even just a small part of it?
Mutational profiling and RT experiments are very different experiments, and they are subjected to very different filtering during the analysis. For instance, in RT-stop, the coverage comes solely from the reads mapping to the same strand as the RNA, so the ones the RT-stop position is inferred from.
Furthermore, reads having been soft-clipped at their 5' end will not be considered in a RT-stop type of experiment, because the true position of the RT-stop cannot be non-ambiguously assigned. I suspect this might be the main reason of the discrepancies you see, especially if you have used Bowtie2 to align your reads, with soft-clipping enabled. If that's the case, setting the --include-clipped
parameter of rf-count should show significantly higher counts.
Please let me know if this helps.
All the best, Danny
Hi Danny,
Thank you for responding so quickly! I have attached a bam file here for you to have a look (note: this bam is unsorted): https://www.dropbox.com/t/dSuBg76DzqpDjkgF
Thanks for pointing out the issue with soft clipping. I have re-run our alignment to disallow soft clipping and the RT coverage now is much closer to MaP coverage as well as samtools depth
output. Although there are still some discrepencies:
RT:
23S
G 3850 3850
G 1780 5630
T 2958 8588
T 2506 11094
A 1835 12929
A 1244 14173
G 1410 15583
C 1054 16637
G 2733 19370
A 1684 21054
C 1887 22941
T 1736 24677
A 1246 25923
A 1232 27155
G 1615 28770
C 1583 30353
G 1041 31394
T 2189 33583
A 539 34122
C 1899 36021
MaP:
23S
G 1032 2185
G 2997 6022
T 745 7803
T 1395 10759
A 1655 13264
A 1511 15093
G 1162 16339
C 1216 17747
G 605 18816
A 1730 21540
C 1668 23223
T 1300 25112
A 1790 26849
A 1286 28095
G 1063 29325
C 1369 30940
G 1293 32524
T 515 33565
A 1415 35754
C 650 36293
samtools depth:
23S 1 2215
23S 2 6065
23S 3 7847
23S 4 10806
23S 5 13316
23S 6 15159
23S 7 16405
23S 8 17812
23S 9 18888
23S 10 21625
23S 11 23308
23S 12 25199
23S 13 26932
23S 14 28189
23S 15 29416
23S 16 31022
23S 17 32616
23S 18 33659
23S 19 35834
23S 20 36376
samtools depth --excl-flags 16:
23S 1 2213
23S 2 6063
23S 3 7843
23S 4 10801
23S 5 13307
23S 6 15142
23S 7 16383
23S 8 17790
23S 9 18847
23S 10 21581
23S 11 23264
23S 12 25147
23S 13 26878
23S 14 28134
23S 15 29361
23S 16 30967
23S 17 32559
23S 18 33601
23S 19 35776
23S 20 36318
Ok, I will check the file tomorrow to confirm, but I believe the discrepancies left are due to the fact that, in RT-stop mode, the position preceding the start mapping position of a read, corresponding to the base that triggered the stop on that specific read/cDNA, is also considered covered (as detailed in the doc).
Best, Danny
@SichongP, your BAM file was corrupted/truncated but managed to test it. I confirm the above reason for the "discrepancies". rf-count is calculating the coverage and rt-stop properly. Feel free to close the issue if this addresses your concerns.
Best, Danny
@SichongP, your BAM file was corrupted/truncated but managed to test it. I confirm the above reason for the "discrepancies". rf-count is calculating the coverage and rt-stop properly. Feel free to close the issue if this addresses your concerns.
Best, Danny
Sorry about the bam file! I had to manually remove some details from the header and maybe something went wrong in the process. I can upload a new one if needed.
but I believe the discrepancies left are due to the fact that, in RT-stop mode, the position preceding the start mapping position of a read, corresponding to the base that triggered the stop on that specific read/cDNA, is also considered covered (as detailed in the doc).
That explains when RT coverage is higher than MaP coverage. But there are still some bases where MaP coverage is higher (highlighted in bold):
RT:
23S G 3850 3850 G 1780 5630 T 2958 8588 T 2506 11094 A 1835 12929 A 1244 14173 G 1410 15583 C 1054 16637 G 2733 19370 A 1684 21054 C 1887 22941 T 1736 24677 A 1246 25923 A 1232 27155 G 1615 28770 C 1583 30353 G 1041 31394 T 2189 33583 A 539 34122 C 1899 36021
MaP:
23S G 1032 2185 G 2997 6022 T 745 7803 T 1395 10759 A 1655 13264 A 1511 15093 G 1162 16339 C 1216 17747 G 605 18816 A 1730 21540 C 1668 23223 T 1300 25112 A 1790 26849 A 1286 28095 G 1063 29325 C 1369 30940 G 1293 32524 T 515 33565 A 1415 35754 C 650 36293
Also, I've noticed that the last two bases always have 0 coverage in RT output but not in MaP output. It makes sense that the last 20 or so bases would not have any RT counts but what is the reason for the last two bases always having 0 coverage?
I see what you mean. Please try to git pull now. The lack of coverage of the first bases was cause reads where the RT-stop would end before the start of the transcript were excluded. I have now fixed this behaviour.
Concerning the mutation/rt-stop coverage, with the partial bam file you sent, I could not replicate it. This is what I see:
23S (mut) G 43 97 G 117 259 T 28 339 T 58 451 A 50 548 A 61 626 G 36 675 C 47 724 G 20 768 A 75 896 C 68 962 T 74 1057 A 60 1119 A 45 1157 G 34 1198 C 67 1267 G 48 1330 T 12 1369 A 67 1469
23S (rt-stop) G 163 262 G 80 342 T 112 454 T 97 551 A 78 629 A 49 678 G 49 727 C 44 771 G 128 899 A 66 965 C 95 1060 T 62 1122 A 38 1160 A 41 1201 G 69 1270 C 63 1333 G 39 1372 T 100 1472 A 32 1504
Any update here? Can i close the issue?
Any update here? Can i close the issue?
Sorry I haven't had time to fix the bam file. I'll share a complete bam file later today.
I've also pulled latest repo but the problem with 0 coverage at last two bases still persists. Additionally, I noticed that MaP counting output always shows 0 mutations for the first two bases now too.
I am sorry @SichongP but by using the previously shared BAM file, I do not see anything like that, whether I use the rt-stop or mutation mode. A larger BAM file won't make the difference. This seems to be an issue on your end.
I suggest you to start with a fresh RNAFramework install, and check your Perl installation is properly working.
Best, Danny
Ok I will try that. Thank you!
@dincarnato Here is a link to the test files and output I just ran, along with commands: https://www.dropbox.com/t/goucqiVPdp23WHcd
I have also comfirmed that I have samtools=1.20
, perl=5.32.1
, and bedtools=2.31.1
, along with a fresh installation of RNAFramework, pinned to
commit bffccb1607b70d3cb9288345b9d468f483958d9c
As you can see in the output files (rf_count_rt/EE010_03.sorted.tsv
and rf_count_mut/EE010_03.sorted.tsv
), RT output (rf_count_rt/EE010_03.sorted.tsv
) is still missing coverage at the last two bases. Additionally, comparing coverage in the two output files, there are 1813 positions where the coverage in RT output is lower than the coverage in MaP output (rf_count_mut/EE010_03.sorted.tsv
), which aren;t explained by the way RT attributes stop events to coverage.
Hi @SichongP,
please git pull. The last 2 bases with 0 counts were due to a minor issue, but did not affect the rest. The reason you do see higher coverages in some parts of the RNA, is because in your BAM file you have both reads mapping to the + and - strand. RT-stop analysis only considers reads mapping to the + strand (strand of the RNA), while mutational analysis uses both.
Best, Danny
Hi @SichongP,
please git pull. The last 2 bases with 0 counts were due to a minor issue, but did not affect the rest. The reason you do see higher coverages in some parts of the RNA, is because in your BAM file you have both reads mapping to the + and - strand. RT-stop analysis only considers reads mapping to the + strand (strand of the RNA), while mutational analysis uses both.
Best, Danny
This makes sense, thank you!
Hi,
I'm using
rf-count
to profile RT stop events in some samples and while playing around with settings, I noticed that it gives wildly different coverage depending on if it's counting RT stop or mutations (without or with-m
).Here is an example of the first 20 bases of each output:
rf-count -p 1 -a -o rt_stop -r -f 23S.fa -pn EE010_03.bam -ow
rf-count -p 1 -a -o map -r -f 23S.fa -pn -q 32 -m EE010_03.bam -ow
Both tsv files were generated using
rf-rctools view -t
and as I understand it, the third column gives the per-base coverage, which should be consistent in both modes?The MaP coverage output is more consistent with
samtools depth
, with some small differences likely explained by slight differences in filters used:The only information I could find about the differences in coverage between RT vs MaP is here: https://rnaframework-docs.readthedocs.io/en/latest/rf-count/#coverage-calculation But this doesn't explain what I'm seeing as according to that, RT coverage should always be equal or higher than MaP coverage, not this significantly lower?
Could you help me understand what exactly rf-count is counting in RT mode, and how this difference could be explained?
Thank you,