HRGV / phyloFlash

phyloFlash - A pipeline to rapidly reconstruct the SSU rRNAs and explore phylogenetic composition of an illumina (meta)genomic dataset.
GNU General Public License v3.0
77 stars 25 forks source link

Incorrect counting of read pairs #173

Closed kbseah closed 1 year ago

kbseah commented 1 year ago

Users have observed that counting of read pairs in log files do not add up properly (lines following Reporting mapping statistics for paired end input). We expect $SSU_total_pairs to equal the sum of $ssu_pairs, %ssu_bad_pairs, and $mapped_half.

The regex in line 1194 of phyloFlash.pl does not cover all possible cases. Some read names include this pattern internally and will be wrongly split, those reads will not be counted correctly. Some libraries have spaces before 1 and 2 read segment suffix. Some do not have segment suffix at all. Adding line-ending $ to regex will not solve the problem either, because the 1 and 2 suffix may not be the last character of the read name.

See https://github.com/HRGV/phyloFlash/commit/c50547f1437fb7abb3d0257d641cbbdc90ec58a7#commitcomment-89103251

The object %qname_hash does not appear to be used elsewhere, could likely be removed without problems.

However similar regex is used in: PhyloFlash.pm lines 976, 1192, 1213 (introduced in 4973a5e34c4ceb0a78f06690012aaa835801f78b)

Thanks to Yannick Colin for detailed report that helped us find this issue.

kbseah commented 1 year ago

Fixed in release pf3.4.2