Oddities with reference coverage

A-N-Other commented 2 years ago

Hi Danny

I'm in the process of putting together a script that scans a working directory for RNAFramework outputs and generates a variety of QC plots for what it finds - sequence coverage, average baseQ across the transcript from the BAMs, summaries of % read mapping, per-base reactivity plots, etc. However, when getting transcript coverage data from an .rc file using rf-rctools view I was getting unrealistically high depths and, if I compare the numbers to what samtools reports, they're very different!

RNAFramework (I'm running 2.7.2): rf-rctools view rf-count/5NIA-1.rc | sed -n '4 s/,/\n/gp' > depths_rf-rctools.txt samtools: samtools depth -a rf-map/5NIA-1.bam | cut -f3 > depths_samtools.txt

For two different samples from two totally separate experiments (the second was from May and run with a different version of RNAFramework), here are two comparison plots: DepthComparison.pdf

What's even more strange to me is that the profiles of those lines are also different! Out of interest I also ran rf-wiggle --ratio and it also uses the rf-rctools version of the depths in the calculation.

Can you shed some light on this for me - am I misinterpreting what's being reported? Thanks for your help!

dincarnato commented 2 years ago

Hi George,

thanks for your message. So, in order to understand what might be going on I need 2 things:

The BAM file
The rf-count command you executed

This way I can reproduce the putative issue. Best,

Danny

A-N-Other commented 2 years ago

Thanks, Danny. I've shared a Dropbox link with you (@rug.nl email) - hope that's OK. I dug around for the smallest BAM I could find that displayed the issue without having to subsample and re-do rf-count. I've shared with you one of the DMSO control samples from an old Flu (PR8) SHAPE-MaP experiment. In the tarball you should have the BAM and index, the .rc file and the two different sets of depths that you get by running:

~/bin/RNAFramework/rf-rctools view -t rf-count/DMSO-1.rc > depths_rf-rctools.txt and samtools depth -a rf-map/DMSO-1.bam > depths_samtools.txt

Directly from my SLURM submission script, the exact command run was:

~/bin/RNAFramework/rf-count \
  --processors 1 \
  --working-threads 8 \
  --sorted \
  --tmp-dir "$tmp" \
  --output-dir rf-count/"$base" \
  --overwrite \
  --fasta reference/reference.fa \
  --mask-file reference/rf-count.blacklist \
  --count-mutations \
  --min-quality 20 \
  --eval-surrounding \
  --collapse-consecutive \
  --max-collapse-distance 6 \
  --right-deletion \
  "$BAM"

Many thanks for your help with this!

George

dincarnato commented 2 years ago

Can you please send also the reference.fa file?

A-N-Other commented 2 years ago

Attached that for you here ... .fa.txt so that GitHub will take it. reference.fa.txt

dincarnato commented 2 years ago

Hi George,

so, the "issue" was caused by the fact that, by default, samtools depth excludes certain reads, for example reads marked as PCR/optical duplicates. rf-count did not make such a selection. I have now added this filtering by default. See the resulting coverage below.

It can be turned off by setting "-ndd" or "--no-discard-duplicates".

I hope this helps and solves the issue. Thanks so much for reporting this (unexpected) behaviour! Best,

Danny

A-N-Other commented 2 years ago

Fantastic! Many thanks for looking into this so quickly. I'll pull down the latest version and re-run my pipeline.

dincarnato commented 2 years ago

Great. Also, please re-pull the latest update. I noticed that, by mistake after my changes, the total % of mutated reads in the output was wrong (twice as much what it was supposed to be). It's fixed now. This did not affect the RC files.

A-N-Other commented 2 years ago

Thanks, Danny. Working exactly as I'd expect at this point.

dincarnato / RNAFramework

Oddities with reference coverage #20