Closed jonperdomo closed 1 year ago
There are various ways that people calculate and report "error rate" of basecalling. Another thing besides CIGAR string to look into is the df:f tag which was implemented in minimap2 (https://github.com/lh3/minimap2/issues/281).
Basecalling error rates should be based on alignment to reference genome. LongReadSum currently calculates error rates based on the SAM MD tag. See this line:
https://github.com/WGLab/LongReadSum/blob/b1423bef7bc7f079560c87ad661025f9fad179fa/BAM_module.cpp#L238
To get alignment mismatches, replace this by checking the reference sequence against the query.