Hi,
I had a question about the DSC_estimated_error_rates.pdf calculation. For one example sample, for one specific trinucleotide context, I tried to reconstruct the calculation:
ACG>T = 150 -> count from results.SSC-mismatches-Pyrimidine.triprofiles.tsv
ACG>T = 15 -> count from results.SSC-mismatches-Purine.triprofiles.tsv
ACG trinucleotide count = 7674995 -> count from results.trint_counts_and_ratio2genome.tsv tri_bg column, which is the number of times ACG/CGT base pairs are observed, which is also equivalent to the number of times ACG is observed in single-strand consensus sequences.
ACG error rate I see in DSC_estimated_error_rates.pdf is ~1.5e-10.
I infer this was calculated as: (150/[7674995/2])(15/[7674995/2]),
which matches this code line in nanoseq_results_plotter.R:
y = (tri_obs_pyr / (tri_bg[tris] / 2)) (tri_obs_pur / (tri_bg[tris] / 2));
However, why was the ACG trinucleotide count of 7674995 divided by 2 for both the pyrimidine and purine parts of the error rate calculation?
I would have thought that the calculation would be:
[probability of having an ACG>T change in one strand = 150 / 7674995] X [probability of having a CGT>A change in the other strand = 15 / 7674995]
= (150/7674995)*(15/7674995) = ~3.8e-11
This is lower by a factor of 4 from the DSC_estimated_error_rates.pdf estimate.
To visually illustrate:
The difference in my suggestion is I am considering the context count from results.trint_counts_and_ratio2genome.tsv tri_bg column as the denominator, rather than half of that, since the number in the tri_bg column is equivalent to the number of times the ACG trinucleotide context is observed in both strands. And I'm assuming the counts in SSC-mismatches-Pyrimidine.triprofiles.tsv contains the number of times ACG>T is observed in both strands (i.e. sum of ACG>T ssDNA calls in both strands), and likewise for CGT>A in SSC-mismatches-Purine.triprofiles.tsv.
It's a bit tricky to think through, so perhaps you can catch an error in my logic.
Hi, I had a question about the DSC_estimated_error_rates.pdf calculation. For one example sample, for one specific trinucleotide context, I tried to reconstruct the calculation:
ACG>T = 150 -> count from results.SSC-mismatches-Pyrimidine.triprofiles.tsv ACG>T = 15 -> count from results.SSC-mismatches-Purine.triprofiles.tsv ACG trinucleotide count = 7674995 -> count from results.trint_counts_and_ratio2genome.tsv tri_bg column, which is the number of times ACG/CGT base pairs are observed, which is also equivalent to the number of times ACG is observed in single-strand consensus sequences.
ACG error rate I see in DSC_estimated_error_rates.pdf is ~1.5e-10.
I infer this was calculated as: (150/[7674995/2])(15/[7674995/2]), which matches this code line in nanoseq_results_plotter.R: y = (tri_obs_pyr / (tri_bg[tris] / 2)) (tri_obs_pur / (tri_bg[tris] / 2));
However, why was the ACG trinucleotide count of 7674995 divided by 2 for both the pyrimidine and purine parts of the error rate calculation?
I would have thought that the calculation would be: [probability of having an ACG>T change in one strand = 150 / 7674995] X [probability of having a CGT>A change in the other strand = 15 / 7674995]
= (150/7674995)*(15/7674995) = ~3.8e-11
This is lower by a factor of 4 from the DSC_estimated_error_rates.pdf estimate.
To visually illustrate:
The difference in my suggestion is I am considering the context count from results.trint_counts_and_ratio2genome.tsv tri_bg column as the denominator, rather than half of that, since the number in the tri_bg column is equivalent to the number of times the ACG trinucleotide context is observed in both strands. And I'm assuming the counts in SSC-mismatches-Pyrimidine.triprofiles.tsv contains the number of times ACG>T is observed in both strands (i.e. sum of ACG>T ssDNA calls in both strands), and likewise for CGT>A in SSC-mismatches-Purine.triprofiles.tsv.
It's a bit tricky to think through, so perhaps you can catch an error in my logic.
Thanks!