cancerit / NanoSeq

Analysis software for Nanorate Sequencing (NanoSeq) experiments
GNU Affero General Public License v3.0
13 stars 8 forks source link

Question about DSC_estimated_error_rates calculation #92

Open gevro opened 7 months ago

gevro commented 7 months ago

Hi, I had a question about the DSC_estimated_error_rates.pdf calculation. For one example sample, for one specific trinucleotide context, I tried to reconstruct the calculation:

ACG>T = 150 -> count from results.SSC-mismatches-Pyrimidine.triprofiles.tsv ACG>T = 15 -> count from results.SSC-mismatches-Purine.triprofiles.tsv ACG trinucleotide count = 7674995 -> count from results.trint_counts_and_ratio2genome.tsv tri_bg column, which is the number of times ACG/CGT base pairs are observed, which is also equivalent to the number of times ACG is observed in single-strand consensus sequences.

ACG error rate I see in DSC_estimated_error_rates.pdf is ~1.5e-10.

I infer this was calculated as: (150/[7674995/2])(15/[7674995/2]), which matches this code line in nanoseq_results_plotter.R: y = (tri_obs_pyr / (tri_bg[tris] / 2)) (tri_obs_pur / (tri_bg[tris] / 2));

However, why was the ACG trinucleotide count of 7674995 divided by 2 for both the pyrimidine and purine parts of the error rate calculation?

I would have thought that the calculation would be: [probability of having an ACG>T change in one strand = 150 / 7674995] X [probability of having a CGT>A change in the other strand = 15 / 7674995]

= (150/7674995)*(15/7674995) = ~3.8e-11

This is lower by a factor of 4 from the DSC_estimated_error_rates.pdf estimate.

To visually illustrate:

Screenshot 2024-03-26 at 5 52 27 PM

The difference in my suggestion is I am considering the context count from results.trint_counts_and_ratio2genome.tsv tri_bg column as the denominator, rather than half of that, since the number in the tri_bg column is equivalent to the number of times the ACG trinucleotide context is observed in both strands. And I'm assuming the counts in SSC-mismatches-Pyrimidine.triprofiles.tsv contains the number of times ACG>T is observed in both strands (i.e. sum of ACG>T ssDNA calls in both strands), and likewise for CGT>A in SSC-mismatches-Purine.triprofiles.tsv.

It's a bit tricky to think through, so perhaps you can catch an error in my logic.

Thanks!

gevro commented 7 months ago

Note: I updated my original post.