adimitromanolakis / sim1000G

Simulation of rare and common variants based on 1000 genomes data
17 stars 1 forks source link

Is there a way to specify percentage of causal variants in SNPs simulation process using sim1000G? #6

Closed BGerm95 closed 3 years ago

BGerm95 commented 3 years ago

Dear @adimitromanolakis and all others,

I am using sim1000G to simulate data for case/control study. I can not figure it out how to manipulate with this code to be able to generate 10% or 50% causal SNPs.

Can anyone explain to me how I can modify code in a way that it will simulate:

  1. 50 % of causal SNPs ( exmpl. 24 causal variants and 24 non causal SNPs)
    1. 10 % of causal SNPs (exmpl. 5 causal and 43 non causal SNPs)

I would like to reuse code from Example3:

library(sim1000G)
vcf_file = "region-chr4-357-ANK2.vcf.gz" #nvariants = 442, ss=1000

vcf = readVCF( vcf_file, maxNumberOfVariants = 442  ,min_maf = 0.0005,max_maf = 0.01) #lowest MAF
dim( vcf$gt1 ) #rows represent number of variants, columns represent number of individuals

## Download and use full chromosome genetic map
downloadGeneticMap(4)
readGeneticMap(4)

sample.size=3000

startSimulation(vcf, totalNumberOfIndividuals = sample.size)

data_sim = function(seed.num){

  SIM$reset()

  id = generateUnrelatedIndividuals(sample.size)

  gt = retrieveGenotypes(id)

  freq = apply(gt,2,sum)/(2*nrow(gt))
  causal = sample(setdiff(1:ncol(gt),which(freq==0)),45)

  beta.sign = rep(1,45)
  c.value = 0.402
  beta.abs = c.value*abs(log10(freq[causal]))
  beta.val = beta.sign*beta.abs
  x.bar = apply(gt[,causal],2,mean)
  x.bar = as.matrix(x.bar)
  beta.val = t(as.matrix(beta.val))
  #disease prvalance = 1%
  #beta0 = -log(99)-beta.val %*% x.bar
  #disease prvalance = 1.5%
  beta0 = 0-beta.val %*% x.bar

  eta = beta.val %*% t(gt[,causal])
  eta = as.vector(eta) + rep(beta0,nrow(gt))
  prob = exp(eta)/(1+exp(eta))

  genocase = rep(NA, sample.size)

  set.seed(seed.num)
  for(i in 1:sample.size){
    genocase[i] = rbinom(1, 1, prob[i])
  }
  case.idx = sample(which(genocase==1),1000)
  control.idx = sample(which(genocase==0),1000)

  return(rbind(gt[case.idx,],gt[control.idx,]))
}

I will be grateful for all suggestions. Thank you in advance.

adimitromanolakis commented 3 years ago

Hi,

you can modify the line:

causal = sample(setdiff(1:ncol(gt),which(freq==0)),45)

This selects 45 causal variants, among variants that have non-zero frequency (computed in the line above).

Best,

Apostolos

On Wed, 28 Oct 2020 at 16:10, BGerm notifications@github.com wrote:

Dear @adimitromanolakis https://github.com/adimitromanolakis and all others,

I am using sim1000G to simulate data for case/control study. I can not figure it out how to manipulate with this code to be able to generate 10% or 50% causal SNPs. I assume that can be set up in this part of code.

freq = apply(gt,2,sum)/(2*nrow(gt)) causal = sample(setdiff(1:ncol(gt),which(freq==0)),45)

Can anyone explain to me what those formulas means and how I can modify code in a way that it will simulate:

  1. 50 % of causal SNPs ( exmpl. 24 causal variants and 24 non causal SNPs)
  2. 10 % of causal SNPs (exmpl. 5 causal and 43 non causal SNPs)

I would like to reuse code from Example3:

library(sim1000G) vcf_file = "region-chr4-357-ANK2.vcf.gz" #nvariants = 442, ss=1000

vcf = readVCF( vcf_file, maxNumberOfVariants = 442 ,min_maf = 0.0005,max_maf = 0.01) #lowest MAF dim( vcf$gt1 ) #rows represent number of variants, columns represent number of individuals

Download and use full chromosome genetic map

downloadGeneticMap(4) readGeneticMap(4)

sample.size=3000

startSimulation(vcf, totalNumberOfIndividuals = sample.size)

data_sim = function(seed.num){

SIM$reset()

id = generateUnrelatedIndividuals(sample.size)

gt = retrieveGenotypes(id)

freq = apply(gt,2,sum)/(2*nrow(gt)) causal = sample(setdiff(1:ncol(gt),which(freq==0)),45)

beta.sign = rep(1,45) c.value = 0.402 beta.abs = c.valueabs(log10(freq[causal])) beta.val = beta.signbeta.abs x.bar = apply(gt[,causal],2,mean) x.bar = as.matrix(x.bar) beta.val = t(as.matrix(beta.val))

disease prvalance = 1%

beta0 = -log(99)-beta.val %*% x.bar

disease prvalance = 1.5%

beta0 = 0-beta.val %*% x.bar

eta = beta.val %*% t(gt[,causal]) eta = as.vector(eta) + rep(beta0,nrow(gt)) prob = exp(eta)/(1+exp(eta))

genocase = rep(NA, sample.size)

set.seed(seed.num) for(i in 1:sample.size){ genocase[i] = rbinom(1, 1, prob[i]) } case.idx = sample(which(genocase==1),1000) control.idx = sample(which(genocase==0),1000)

return(rbind(gt[case.idx,],gt[control.idx,])) }

I will be grateful for all suggestions. Thank you in advance.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adimitromanolakis/sim1000G/issues/6, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEGJVY3SSCT35535WBZGTWLSNAQ5TANCNFSM4TCMBG4A .