dincarnato / RNAFramework

RNA structure probing and post-transcriptional modifications mapping high-throughput data analysis
http://www.rnaframework.com
GNU General Public License v3.0
31 stars 11 forks source link

Differences in coverage between RT stop and MaP mode for rf-count? #62

Closed SichongP closed 2 months ago

SichongP commented 3 months ago

Hi,

I'm using rf-count to profile RT stop events in some samples and while playing around with settings, I noticed that it gives wildly different coverage depending on if it's counting RT stop or mutations (without or with -m).

Here is an example of the first 20 bases of each output:

23S
G   965 965
G   1134    2099
T   1181    3280
T   756 4036
A   418 4454
A   362 4816
G   236 5052
C   433 5485
G   942 6427
A   235 6662
C   395 7057
T   212 7269
A   160 7429
A   280 7709
G   263 7972
C   277 8249
G   449 8698
T   754 9452
A   98  9550
C   592 10142
23S
G   0   103610
G   0   105186
T   1383    108750
T   18  110244
A   184 111909
A   25  113301
G   145 114456
C   83  115367
G   165 116610
A   125 117730
C   107 118943
T   58  120427
A   78  121517
A   82  122766
G   269 123853
C   153 124769
G   169 125942
T   91  127152
A   66  128186
C   258 129150

Both tsv files were generated using rf-rctools view -t and as I understand it, the third column gives the per-base coverage, which should be consistent in both modes?

The MaP coverage output is more consistent with samtools depth, with some small differences likely explained by slight differences in filters used:

23S     1       103613
23S     2       105189
23S     3       108753
23S     4       110247
23S     5       111912
23S     6       113304
23S     7       114380
23S     8       115318
23S     9       116580
23S     10      117676
23S     11      118908
23S     12      120400
23S     13      121476
23S     14      122765
23S     15      123818
23S     16      124722
23S     17      125924
23S     18      127141
23S     19      128131
23S     20      129102

The only information I could find about the differences in coverage between RT vs MaP is here: https://rnaframework-docs.readthedocs.io/en/latest/rf-count/#coverage-calculation But this doesn't explain what I'm seeing as according to that, RT coverage should always be equal or higher than MaP coverage, not this significantly lower?

Could you help me understand what exactly rf-count is counting in RT mode, and how this difference could be explained?

Thank you,

SichongP commented 3 months ago

I've also tried -ndd to disable removing duplicates in RT mode but that didn't change the results. My bam file already has duplicates identified based on UMI and removed. No reads have duplicate flag in input bam file.

dincarnato commented 3 months ago

Hi @SichongP,

thanks for your message. So, in order to understand the reason for such differences, I will have to look at the BAM file. Can you share it, or even just a small part of it?

Mutational profiling and RT experiments are very different experiments, and they are subjected to very different filtering during the analysis. For instance, in RT-stop, the coverage comes solely from the reads mapping to the same strand as the RNA, so the ones the RT-stop position is inferred from.

Furthermore, reads having been soft-clipped at their 5' end will not be considered in a RT-stop type of experiment, because the true position of the RT-stop cannot be non-ambiguously assigned. I suspect this might be the main reason of the discrepancies you see, especially if you have used Bowtie2 to align your reads, with soft-clipping enabled. If that's the case, setting the --include-clipped parameter of rf-count should show significantly higher counts.

Please let me know if this helps.

All the best, Danny

SichongP commented 3 months ago

Hi Danny,

Thank you for responding so quickly! I have attached a bam file here for you to have a look (note: this bam is unsorted): https://www.dropbox.com/t/dSuBg76DzqpDjkgF

Thanks for pointing out the issue with soft clipping. I have re-run our alignment to disallow soft clipping and the RT coverage now is much closer to MaP coverage as well as samtools depth output. Although there are still some discrepencies:

23S
G   1032    2185
G   2997    6022
T   745 7803
T   1395    10759
A   1655    13264
A   1511    15093
G   1162    16339
C   1216    17747
G   605 18816
A   1730    21540
C   1668    23223
T   1300    25112
A   1790    26849
A   1286    28095
G   1063    29325
C   1369    30940
G   1293    32524
T   515 33565
A   1415    35754
C   650 36293
23S     1       2213
23S     2       6063
23S     3       7843
23S     4       10801
23S     5       13307
23S     6       15142
23S     7       16383
23S     8       17790
23S     9       18847
23S     10      21581
23S     11      23264
23S     12      25147
23S     13      26878
23S     14      28134
23S     15      29361
23S     16      30967
23S     17      32559
23S     18      33601
23S     19      35776
23S     20      36318
dincarnato commented 3 months ago

Ok, I will check the file tomorrow to confirm, but I believe the discrepancies left are due to the fact that, in RT-stop mode, the position preceding the start mapping position of a read, corresponding to the base that triggered the stop on that specific read/cDNA, is also considered covered (as detailed in the doc).

Best, Danny

dincarnato commented 2 months ago

@SichongP, your BAM file was corrupted/truncated but managed to test it. I confirm the above reason for the "discrepancies". rf-count is calculating the coverage and rt-stop properly. Feel free to close the issue if this addresses your concerns.

Best, Danny

SichongP commented 2 months ago

@SichongP, your BAM file was corrupted/truncated but managed to test it. I confirm the above reason for the "discrepancies". rf-count is calculating the coverage and rt-stop properly. Feel free to close the issue if this addresses your concerns.

Best, Danny

Sorry about the bam file! I had to manually remove some details from the header and maybe something went wrong in the process. I can upload a new one if needed.

but I believe the discrepancies left are due to the fact that, in RT-stop mode, the position preceding the start mapping position of a read, corresponding to the base that triggered the stop on that specific read/cDNA, is also considered covered (as detailed in the doc).

That explains when RT coverage is higher than MaP coverage. But there are still some bases where MaP coverage is higher (highlighted in bold):

RT:

23S G 3850 3850 G 1780 5630 T 2958 8588 T 2506 11094 A 1835 12929 A 1244 14173 G 1410 15583 C 1054 16637 G 2733 19370 A 1684 21054 C 1887 22941 T 1736 24677 A 1246 25923 A 1232 27155 G 1615 28770 C 1583 30353 G 1041 31394 T 2189 33583 A 539 34122 C 1899 36021

MaP:

23S G 1032 2185 G 2997 6022 T 745 7803 T 1395 10759 A 1655 13264 A 1511 15093 G 1162 16339 C 1216 17747 G 605 18816 A 1730 21540 C 1668 23223 T 1300 25112 A 1790 26849 A 1286 28095 G 1063 29325 C 1369 30940 G 1293 32524 T 515 33565 A 1415 35754 C 650 36293

SichongP commented 2 months ago

Also, I've noticed that the last two bases always have 0 coverage in RT output but not in MaP output. It makes sense that the last 20 or so bases would not have any RT counts but what is the reason for the last two bases always having 0 coverage?

dincarnato commented 2 months ago

I see what you mean. Please try to git pull now. The lack of coverage of the first bases was cause reads where the RT-stop would end before the start of the transcript were excluded. I have now fixed this behaviour.

Concerning the mutation/rt-stop coverage, with the partial bam file you sent, I could not replicate it. This is what I see:

23S (mut) G 43 97 G 117 259 T 28 339 T 58 451 A 50 548 A 61 626 G 36 675 C 47 724 G 20 768 A 75 896 C 68 962 T 74 1057 A 60 1119 A 45 1157 G 34 1198 C 67 1267 G 48 1330 T 12 1369 A 67 1469

23S (rt-stop) G 163 262 G 80 342 T 112 454 T 97 551 A 78 629 A 49 678 G 49 727 C 44 771 G 128 899 A 66 965 C 95 1060 T 62 1122 A 38 1160 A 41 1201 G 69 1270 C 63 1333 G 39 1372 T 100 1472 A 32 1504

dincarnato commented 2 months ago

Any update here? Can i close the issue?

SichongP commented 2 months ago

Any update here? Can i close the issue?

Sorry I haven't had time to fix the bam file. I'll share a complete bam file later today.

I've also pulled latest repo but the problem with 0 coverage at last two bases still persists. Additionally, I noticed that MaP counting output always shows 0 mutations for the first two bases now too.

dincarnato commented 2 months ago

I am sorry @SichongP but by using the previously shared BAM file, I do not see anything like that, whether I use the rt-stop or mutation mode. A larger BAM file won't make the difference. This seems to be an issue on your end.

I suggest you to start with a fresh RNAFramework install, and check your Perl installation is properly working.

Best, Danny

SichongP commented 2 months ago

Ok I will try that. Thank you!

SichongP commented 2 months ago

@dincarnato Here is a link to the test files and output I just ran, along with commands: https://www.dropbox.com/t/goucqiVPdp23WHcd

I have also comfirmed that I have samtools=1.20, perl=5.32.1, and bedtools=2.31.1, along with a fresh installation of RNAFramework, pinned to commit bffccb1607b70d3cb9288345b9d468f483958d9c

As you can see in the output files (rf_count_rt/EE010_03.sorted.tsv and rf_count_mut/EE010_03.sorted.tsv), RT output (rf_count_rt/EE010_03.sorted.tsv) is still missing coverage at the last two bases. Additionally, comparing coverage in the two output files, there are 1813 positions where the coverage in RT output is lower than the coverage in MaP output (rf_count_mut/EE010_03.sorted.tsv), which aren;t explained by the way RT attributes stop events to coverage.

dincarnato commented 2 months ago

Hi @SichongP,

please git pull. The last 2 bases with 0 counts were due to a minor issue, but did not affect the rest. The reason you do see higher coverages in some parts of the RNA, is because in your BAM file you have both reads mapping to the + and - strand. RT-stop analysis only considers reads mapping to the + strand (strand of the RNA), while mutational analysis uses both.

Best, Danny

SichongP commented 2 months ago

Hi @SichongP,

please git pull. The last 2 bases with 0 counts were due to a minor issue, but did not affect the rest. The reason you do see higher coverages in some parts of the RNA, is because in your BAM file you have both reads mapping to the + and - strand. RT-stop analysis only considers reads mapping to the + strand (strand of the RNA), while mutational analysis uses both.

Best, Danny

This makes sense, thank you!