HannahVMeyer / PhenotypeSimulator

Other
28 stars 7 forks source link

Question regarding standardiseGenotypes() and getAlleleFrequencies() #17

Closed gamason closed 3 years ago

gamason commented 5 years ago

Hello!

I have been following the code that is described in the manual. My goal is to use my own genotype data to simulate a complex trait and your package seems perfect to address my goals. My reason for using my own genotypes is to preserve some of the real (and possibly problematic) population structure in the data.

However, I have run into an issue with the standardiseGenotypes() function. Below, I have pasted the code I have been using. I have also provided a Google drive link for the files of interest here. My input files are in plink1.9 format, if that helps. As I understand it, alleles are encoded in .bed files with following format: 00 | Homozygous for first allele in .bim file 01 | Missing genotype 10 | Heterozygous 11 | Homozygous for second allele in .bim file

I made my plink formatted files with the following command

plink1.9 --vcf SLC.allch.subset1k.vcf --allow-extra-chr --out SLC.allch.subset1k

in R-3.5.3, I ran the following lines of code

library('PhenotypeSimulator') genotypefile <- 'SLC.allch.subset1k' genotypes <- readStandardGenotypes(N=166, filename = genotypefile, format="plink", verbose=TRUE) genotypes_sd <- standardiseGenotypes(genotypes$genotypes) Error in FUN(X[[i]], ...) : SNP vector contains alleles not encoded as 0, 1 or 2

As I understand it, the function standardiseGenotypes() (and therefore getAlleleFrequencies()) is having trouble with how the genotypes are coded in my input files. Is there a way I can work around this? Are missing genotypes by any chance causing a problem? Is there specific flag that I need plink1.9 into take in account when making the .bed file?

Thank you for any help! This is a wonderful package and I look forward to using it!

~AM

HannahVMeyer commented 5 years ago

Thank you for providing the sample code and data files. It is indeed caused by the NAs in you genotypes, which the simulation currently cannot handle and which I haven't thought about much yet. What should be done if a sample has missing genotypes for which an effect should be simulated:

  1. NA phenotype? This does not seem useful
  2. Impute missing genotypes? Hmisc::impute would be an option here
  3. Only fully genotyped individuals can be provided? That does not seem the most useful way either, as there will always be missing genotype calls in datasets.

I am tending to option 2 and including a parameter to specify that missing genotypes should be imputed; otherwise, will fail with an error message that missing genotypes were encountered.

I will look into this and get back to you.

Thanks,Hannah

gamason commented 5 years ago

Thank you so much the reply! From my end, I can also do some filtering to remove individuals that are missing too many genotypes or impute some of the missing genotypes.

Thank you again for the help!

HannahVMeyer commented 5 years ago

I have added a preliminary solution ( 616595c16652ed830cb17ff4908b7c52dcc69f84) to the dev branch of this repos. Before standardising, it will mean-impute missing values per SNP, and round results to closest integer. If you install

devtools::install_github("HannahVMeyer/PhenotypeSimulator", ref="dev")

and try to rerunning your code with impute=TRUE:

genotypes_sd <- standardiseGenotypes(genotypes$genotypes, impute=TRUE)

I will have to formalise this and check in other places how missing genotypes are dealt with. Additionally, will have to think about how to incorporate already imputed genotypes that are not hard calls. Thank you for pointing this out. I will update on this thread and once addressed push to master branch.