Transcriptome mode error rate tsv explanation

kosmasgal commented 1 year ago

My model's error rates:

Mismatch rate: 0.016413157857845667 Insertion rate: 0.016979855862026057 Deletion rate: 0.013089932495649183 Total error rate: 0.04648294621552091

Could you please help me understand what these error rates represent? Are these error rates per base or does it have to do something with the Markov chain model? If they are per base does that mean that for example in about 4.6% of my read bases, I have an error and then that is the sum of all error rates?

For example, using the insertion rate:

If I have a read of that chemistry of a 1000 bases long, then I should expect about 17 insertions to occur, where the length of each is determined by the respective Weibull distribution?

kmnip commented 1 year ago

The definition of error rates in the simulated reads should be identical to how errors are profiled in the experimental reads.

For example, the error rate for insertions is calculated like so:

total_ins * 1.0 / (total_mis + total_match + total_del)

where:

total_ins is the total number of inserted bases;
total_mis is the total number of mismatched bases
total_match is the total number of matched bases;
total_del is the total number of deleted bases;

all of which are determined relative to the reference genome.

If I have a read of that chemistry of a 1000 bases long, then I should expect about 17 insertions to occur...

This assumption is incorrect because the error rates are calculated based on the number of inserted bases, not the number of insertion events.

kosmasgal commented 1 year ago

Thank you very much for the explanation!

bcgsc / NanoSim

Transcriptome mode error rate tsv explanation #186