dincarnato / RNAFramework

RNA structure probing and post-transcriptional modifications mapping high-throughput data analysis
http://www.rnaframework.com
GNU General Public License v3.0
31 stars 11 forks source link

shapemapper2 comparison #39

Closed catherine-douds closed 1 year ago

catherine-douds commented 1 year ago

Hello,

I am working on a mutational profiling experiment with a new chemical probe that has a lower mutation rate and is very sensitive to extra noise. I really enjoy the tools available with RNAFramework, but am getting more accurate reactivity values with shapemapper2. I think this might have something to do with the rf-count module. Is it possible to recreate the counting of shapemapper2 with a certain flag combination in this module?

I tried -t5 primer+1, -la -ld -cc -mc 6 -q 30 -es and the results were close but not quite the same.

dincarnato commented 1 year ago

Dear catherine-douds,

thanks for your message. I am not sure which parameters of SHAPEmapper you are using, nor I have ever dived into the algorithmic details of the tool. Can you provide some more context? For instance, how do you quantify the "accuracy" of the reactivity values?

Best, Danny

catherine-douds commented 1 year ago

Hi Danny,

I am using shapemapper2 and quantifying accuracy with ROC curves of the ribosomal data I am looking at, where a true positive is reactive and single stranded and a true negative is below the reactivity threshold and paired.

I used the -t5 primer+1 flag to account for shapemapper2's primer trimming step. The -la and -ld I think correspond to how they describe the ambiguously aligned mutation step. -cc -mc 6 should account for multinucleotide handling where mutations within 6nt are collapsed to the left-most position. And -q 30 -es should account for post-alignment basecall quality filtering. However, it seems like in the shapemapper2 algorithm this step also changes the effective coverage at a nucleotide and I don't see a way to do that with RNAFramework. The algorithms may also differ by their chemical adduct location inference, where shapemapper2 assumes the adduct position is immediately 5' of the last unchanged position (or the first mutated nucleotide during RT), but I also am a little confused at this step and may be interpreting things inaccurately.

Thank you, Catherine

dincarnato commented 1 year ago

When you talk about the "primer trimming step", are we talking about a random RT primer, or a gene-specific primer? Is it the forward or reverse primer? Is this experiment sequenced in a directional or non-directional fashion?

catherine-douds commented 1 year ago

So sorry, it's the random RT primer. The libraries were prepared as paired end directional sequencing.

dincarnato commented 1 year ago

Is this directional second-strand or first-strand? So, is the sequence of your forward read the same as the RNA strand, or the reverse complement?

catherine-douds commented 1 year ago

The forward read with the R1 adapter on the 5' end is the reverse compliment of the RNA.

dincarnato commented 1 year ago

Ok now I see. The issue is that the -t5 parameter only applies to RT-stop type experiments, so it has no effect with mutational profiling. Can you try to map your reads after trimming the 5' end and then pass that to rf-count? If you are mapping using rf-map you can just pass the -b5 parameter. Let me know if this improves the results. If not, we can try dig into it a bit more.

catherine-douds commented 1 year ago

Hi Danny, changing this parameter improves the AUC marginally, but shapemapper2 is still doing better. I think I narrowed down the difference to two factors.

1) I think different normalization plays a roll because when I map the raw mutation rates from both programs they are nearly identical. 2) I think there is also a minor benefit from the way shapemapper calculates the effective coverage, which I think comes from the post alignment basecall quality filtering step in their algorithm.

dincarnato commented 1 year ago

Can you share with me the raw reactivity profiles from both programs, as well as the normalized reactivity profiles?

dincarnato commented 1 year ago

Hi Catherine,

Any chance you can share the data I indicated in my prev. message? I have a hunch about where the difference might reside, but I would need the data to verify it!

Thanks, Danny

catherine-douds commented 1 year ago

Hi Danny,

I am excited to see what you figured out since I am so stumped. Since this involves sharing unpublished data, I will pivot to sharing it by email.

Catherine

dincarnato commented 1 year ago

Sure, no problem. I don't need the sequence of the transcripts, only the values.

catherine-douds commented 1 year ago

Making sure that the random primer was trimmed before rf-count, along with the flags -la -ld -cc -mc 6 -q 30 -es got the results to be reproducible!

Thank you for your help, I am very excited to use this as a jumping off point to optimize this further.

dincarnato commented 1 year ago

Thanks Catherine, glad we got this to work! Will close the issue then :-)

All the best, Danny