KamilSJaron / smudgeplot

Inference of ploidy and heterozygosity structure using whole genome sequencing data
Apache License 2.0
227 stars 24 forks source link

estimate_1n_coverage_1d_subsets averages over two distinct high peaks #123

Closed KamilSJaron closed 5 days ago

KamilSJaron commented 1 year ago

About your genome

The tetraploid Sacharomyces (SRR3265401)

The AB/AABB, AAB and AAAB subset have one major peak each and they end up being estimated to be

30.4 2236.5
20.2 1431.4
14.6 3077.4

The 20.2 is simply a messup, but 30.4 and 14.6 are a mistake by denominator - AB is dividing by the coverage AABB smudge (thinking it's AB) which leads to doubling of the AB/AABB coverage estimate; 14.6 is close to truth because the first peak indeed is the AAAB.

Weight mean ends with an estimate in between of two possible interpretations ~20. Which is a bit unfortunate, perhaps a weighted median would do a better justice. But this should be tested with many many genomes.

KamilSJaron commented 5 days ago

This style of coverage estimates is abolished int he next version