clwgg / nQuire

A statistical framework for ploidy estimation using NGS short-read data
MIT License
54 stars 8 forks source link

Mapping quality and ploidy inference #17

Open mason-linscott opened 2 years ago

mason-linscott commented 2 years ago

Hi @clwgg,

I am using nQuire to determine the ploidy of a highly repetitive metazoan genome (85% repeat content) using paired end data at 60x coverage. When I run nQuire using the defaults and denoise the bin file, the output of lrdmodel indicates the genome is a tetraploid. However, when I restrict the sites to those that have a mapping quality greater than 30 and coverage less than 200 the diploid model has the lowest delta likelihood.

In your manuscript "The exact coverage and number of positions needed for a reliable estimation of ploidy will however depend on the complexity and repetitiveness of the genome. Additionally, it is possible to obtain high quality positions by using BED files to define regions of low repetitiveness, where base frequencies can be more confidently assessed." Could you please elaborate on this? Any advice for choosing between these results or for a different approach would be appreciated.

lrmodel results below and denoised histos attached.

BASE denoised: nQuire/nQuire lrdmodel base_denoised.bin file free dip tri tet d_dip d_tri d_tet base_denoised.bin 8828893.543759 -109585.312918 3402379.079818 7818904.804943 8938478.856677 5426514.463940 1009988.738816

q30 max cov 200 nQuire/nQuire lrdmodel c20q30m200_denoised.bin file free dip tri tet d_dip d_tri d_tet c20q30m200_denoised.bin 3851483.940194 3626143.067490 2214822.767299 1718518.690290 225340.872705 1636661.172895 2132965.249904

base_histo.txt q30m200_histo.txt

Thank you! -M