bcm-uga / pcadapt

Performing highly efficient genome scans for local adaptation with R package pcadapt v4
https://bcm-uga.github.io/pcadapt
37 stars 10 forks source link

Unexpected imputation in pcadapt.pcadapt_pool #43

Closed brandonlind closed 4 years ago

brandonlind commented 4 years ago

In the function pcadapt.pcadapt_pool, the code will impute any missing data with the mean frequency. This should likely be a flag option instead of default. Despite documentation describing that a 9 should be put in place of any missing data, imputation is not mentioned in the manual.pdf for v4.1.0, nor in the article https://bcm-uga.github.io/pcadapt/articles/pcadapt.html

privefl commented 4 years ago

Indeed, the documentation for pollseq data is not very good, and maybe not up to date (no more sampling @mblumuga?).

For imputation, you can always impute before giving the matrix to pcadapt() I believe.

brandonlind commented 4 years ago

Imputing is fine I would think, but the function could impute without knowledge of user (for those who don't read the code). I was adding 9s to missing data after reading in with pcadapt.read, given that the docs mention this is automatic when read in from bed, etc, but did not mention poolseq explicitly.

pcmat <- pcadapt.read(mat)

pcmat[is.na(pcmat)] <- 9

...

But once I forgot to put in the 9s and had to go figure out why my data looked weird. A flag, or a printed warning, in the pcadapt.pcadapt_pool function could help avoid unexpected behavior for unbeknownst users.

privefl commented 4 years ago

The value 9 is used only for formats "pcadapt" and "lfmm".

For poolseq data, pcmat should be a standard R matrix with standard missing values.