HannahVMeyer / PhenotypeSimulator

Other
28 stars 7 forks source link

Specified effects of individual SNPs #23

Closed roberthyde closed 4 years ago

roberthyde commented 4 years ago

I'm currently aiming to simulate a genetic dataset with some "known" genetic effects on a phenotype and a range of noise variables.

Is it possible within PhenotypeSimulator to specify the level of effect each SNP has on a phenotype? For example, could I have 2 SNPs which account for 30% and 50% of the variability in a single phenotype respectively (with lets say 100 observations) , 3 SNPs that are random (with no effect on phenotype), and the remaining 20% variation being random noise? A "perfect" model would then be able to explain 80% of the variation in phenotype from the 2 causal SNPs, and would ignore the other 3 uninformative SNPs.

I can't see a method of specifying the level of variability to be explained by each individual SNP in the vignette, do you have any ideas?

HannahVMeyer commented 4 years ago

Have a look at this step-by-step instruction of simulating phenotypes with 5 different components. You could then do something similar with just two components, fixed genetic effects and random noise.

Specifically, you could:

# simulate genotypes
genotypes <- simulateGenotypes(N = 100, NrSNP = 2, 
                               frequencies = c(0.05, 0.1, 0.3, 0.4), 
                               verbose = FALSE)

# simulate genetic fixed effects for two snps independently
causalSNP_1 <- getCausalSNPs(N=100, genotypes = genotypes$genotypes, 
                            NrCausalSNPs = 1, verbose = FALSE)
genFixed_1 <- geneticFixedEffects(N = 100, P = 1, X_causal = causalSNP_1, pIndependentGenetic=0)  

causalSNP_2<- getCausalSNPs(N=100, genotypes = genotypes$genotypes, 
                            NrCausalSNPs = 1, verbose = FALSE)
genFixed_2 <- geneticFixedEffects(N = 100, P = 1, X_causal = causalSNP_2, pIndependentGenetic=0)

# simulate observational noise effects
noiseBg <- noiseBgEffects(N = 100, P = 1, independent=FALSE)  

Then you'd set the desired amount of variance explained per component:

var_snp_1 <- 0.3
var_snp_2 <- 0.5
genVar <- var_snp_1 + var_snp_2
noiseVar <- 1 - genVar

# rescale phenotype components
genFixed_1_scaled <- rescaleVariance(genFixed_1$shared, var_snp_1)
genFixed_2_scaled <- rescaleVariance(genFixed_2$shared, var_snp_2)
noiseBg_scaled <- rescaleVariance(noiseBg$shared, noiseVar)

Finally you add up all components:

# combine components into final phenotype
Y <- scale(genFixed_1_scaled$component + genFixed_2_scaled$component + noiseBg_scaled$component)
roberthyde commented 4 years ago

Perfect, many thanks Hannah that's worked a charm.

Is it also possible to specify correlations between SNPs, for example if I wished two SNPs to be correlated with each other (let's say a correlation of 0.4)? And can I subsequently then specify the amount of variation in the phenotype explained by one or both of these correlated SNPs?

If not, I suppose an alternative route would be to identify SNPs that happen to be correlated to a certain degree after initial SNP generation and select those specific SNPs to explain a specified amount of variance in phenotype as you have shown previously.

HannahVMeyer commented 4 years ago

There is no direct implementation for this currently, as the SNPs are simulated independently with binomial distribution.

I found an arxiv paper on correlated binomial models and this on stack-overflow for generating correlated binomial variables in R. You could use this code, to simulate SNPs with the desired correlation structure, save them in the format output by simulateGenotypes and then follow along as described above. This might be a good addition to the genotype simulation function, but I will not have the time to look into it in the next couple of weeks, unfortunately.

Alternatively, you could sample from real genotypes, like from the 1000 Genomes or HapMap project. Filter for SNPs with desired r2 first (using plink for instance). PhenotypeSimulator can read genotypes in various standard formats, including plink binary format (.bim/.fam/.bed). I have another R package for genotype qc called plinkQC and this vignette shows how to download and format 1000Genomes data, if that's an avenue you want to follow.

HannahVMeyer commented 4 years ago

I am closing this issue for now. Feel free to reopen, if you have any further questions/comments.