WGLab / LongReadSum

MIT License
4 stars 1 forks source link

Update basecalling error rate implementation #17

Closed jonperdomo closed 1 year ago

jonperdomo commented 1 year ago

Basecalling error rates should be based on alignment to reference genome. LongReadSum currently calculates error rates based on the SAM MD tag. See this line:

https://github.com/WGLab/LongReadSum/blob/b1423bef7bc7f079560c87ad661025f9fad179fa/BAM_module.cpp#L238

To get alignment mismatches, replace this by checking the reference sequence against the query.

kaichop commented 1 year ago

There are various ways that people calculate and report "error rate" of basecalling. Another thing besides CIGAR string to look into is the df:f tag which was implemented in minimap2 (https://github.com/lh3/minimap2/issues/281).