dincarnato / RNAFramework

RNA structure probing and post-transcriptional modifications mapping high-throughput data analysis
http://www.rnaframework.com
GNU General Public License v3.0
31 stars 11 forks source link

rf-rctools stats #60

Closed light0112 closed 4 months ago

light0112 commented 4 months ago

Hi, Very useful feature to print the read numbers that are used for counting. I am trying to use this information to filter transcripts with high read numbers. However, I got confused by the difference between the reported total read number and the manually summed total read number. For example, in one of my rc file, the reported total reads for all the transcripts are 17,716,410. But when I added up the numbers from all the individual transcripts, I got 12,579,347. What could be the reason for the discrepancy? And it seems that the individual transcript read count is not accurate, for example, in the same file, I got 262402 reads for one transcript by calling rf-rctools stats, but when I use rf-rctools view, I can see the highest coverage nucleotide is around 320000. I'm constantly learning as I'm a beginner to bioinformatics, and I appreciate your help!

dincarnato commented 4 months ago

Hi,

To understand if this is a real issue or an expected behaviour, I would need a test BAM file, Fasta file, to reproduce "the issue". Also the command line you executed.

Best, Danny

light0112 commented 4 months ago

Hi, I found out what causes the problem. This comes from a bug from "rf-rctools merge". Please see attached example: ~/test$ rf-rctools merge 7SK.rc 7SK_2.rc 7SK_3.rc rf-rctools stats 7SK.rc 7SK_0-330 109922

Total 5137063 ~/test$ rf-rctools stats 7SK_2.rc 7SK_0-330 114366

Total 5413933 ~/test$ rf-rctools stats 7SK_3.rc 7SK_0-330 4913

Total 257458 ~/test$ rf-rctools stats merge.rc 7SK_0-330 119279

Total 10808454 In the above example, when the three seperate files were merged, the read number of the first file for the 7SK transcript is not added in the merged file (114366+4913=119279), yet the total read number is correct (all 3 files were added together). This caused the difference I observed before. I have attached my test files here. I can also confirm that the mutation count and coverage calculation is not affected by this bug.

Thanks, test.zip

dincarnato commented 4 months ago

Hi light0112,

yes, indeed, found the issue(s). Thanks for reporting it!

From the first of the files the coverage and count values were taken, but not the number of reads. This is fixed now. Furthermore, the reason the sum of the reads in the individual transcripts did not sum up to the total of the RC file, was due to the fact that only primary alignments were counted for the transcripts, while all alignments were counted for the total. This should also be fixed now.

Concerning the fact that:

individual transcript read count is not accurate, for example, in the same file, I got 262402 reads for one transcript by calling rf-rctools stats, but when I use rf-rctools view, I can see the highest coverage nucleotide is around 320000.

This is again most likely due to the fact that only primary alignments are counted when computing the read count, while all alignments are considered by default when calculating coverage and mutations/RT stops. If you re-run rf-count with the -pn or --primary-only flag, the discrepancy should disappear.

My suggestion is to filter transcripts by median coverage, rather than absolute read count. This can be achieved via the rf-norm module, through the -ec parameter.

Please isse a git pull and let me know if this solves.

All the best, Danny

light0112 commented 4 months ago

Hi, Thanks for the fix. The merge command works fine for me now. The previous discrepancy between 262402 and 320000 was simply because I merged 6 different rc files. I'm pretty sure I only had only the primary alignment reported in my bam file, but thanks for letting me know the fix, it could be helpful in the future. The reason I would want to use the absolute read count is because I had an enrichment towards 5'UTR during my library prep, therefore it will be very difficult to set up the threshold if I use median coverage option, as the transcript length varies a lot. Do you have other better solution to this scenario? Thanks a lot!

dincarnato commented 4 months ago

Understood. Maybe you can calculate the median coverage in bins, and only take the n-th 5'-most ones? Feel free to close the issue if solved.

Cheers, Danny