adimitromanolakis / sim1000G

Simulation of rare and common variants based on 1000 genomes data
17 stars 1 forks source link

Comparing real vs simulated samples #4

Closed jcgrenier closed 3 years ago

jcgrenier commented 4 years ago

Hello @adimitromanolakis ,

We previously worked with your tool during the last year in order to do data augmentation. However, we found out that the simulated samples looks a lot more alike to each other when compared to real samples. We could saw this using a PCA method for instance. Have you ever saw that before? Could this be something we do wrong or you would expect this behaviour?

Thanks a lot for your help! Have a nice day!

adimitromanolakis commented 4 years ago

Hi, we haven't noticed such behavior in our tests. In our tests with PCA the results were consistent with the original samples. I would suggest looking at the genotypes and correlations between markers of the original and simulated samples. If you find some specific deviation in the correlations, can you update us with some example results and input files if possible?

jcgrenier commented 4 years ago

Hi, Here's the command lines I used for each of the population.

 library(sim1000G)

vcf_file = "affy_harmonized_maf005_pruned.phased.chr$chr.$pop.recode.vcf.gz"
vcf  = readVCF(vcf_file, maxNumberOfVariants = $sites ,min_maf = NA)
readGeneticMapFromFile(filelocation="genetic_map_GRCh37_chr$chr.txt.gz")

startSimulation(vcf, totalNumberOfIndividuals = 1000)

ids = generateUnrelatedIndividuals(200)
genotype = retrieveGenotypes(ids)
write.table(genotype,file="chr$chr.simulate.200.$pop.012.txt") 

Do you see something obvious that could have done something wrong here? Also, just note that I put the entire chromosomes in each simulation and I'm specifying the number of SNPs explicitely using the maxNumberOfVariants option. I'm simulating 1000 samples, and generate 200 unrelated samples from those ones. I'll put together an example if you don't see anything wrong here. Thanks a lot! JC

jcgrenier commented 4 years ago

Hello @adimitromanolakis, Here's an example. I did a PCA on the data and I've put the result in there too.

affy_harmonized_maf005_pruned.phased.chr10.ACB.recode.vcf.gz chr10.simulate.200.ACB.012.txt.gz pca chr10 ACBpop

Hope this help reviewing my issue.

Best, JC

jcgrenier commented 4 years ago

Hello @adimitromanolakis, Just checking if you had by chance looked at my issue. Thanks a lot for your help.

JC

jcgrenier commented 4 years ago

Hello @adimitromanolakis , any news regarding this issue? Thanks a lot!

JC

adimitromanolakis commented 4 years ago

Hi JC, no we haven't seen this issue. Are you using the same set of variants to compute the PCA between the real and simulated data? Because you used the internal filtering of sim1000G, the set of variants will be different, compared to the original vcf file.

Apostolos

adimitromanolakis commented 4 years ago

Here is a code that used the filtered vcf to compute PCA, it seems fine for this example:


setwd("/tmp")

library(sim1000G)

vcf_file = "affy_harmonized_maf005_pruned.phased.chr10.ACB.recode.vcf.gz"
vcf  = readVCF(vcf_file, maxNumberOfVariants = 1000 ,min_maf = NA)
readGeneticMapFromFile(downloadGeneticMap(10))

startSimulation(vcf, totalNumberOfIndividuals = 1000)

SIM$reset()
ids = generateUnrelatedIndividuals(200)
genotype = retrieveGenotypes(ids)

str(genotype)

p1 <- prcomp((genotype[,1:500]) )
str(p1)

# Retrieve filtered genotypes
og = t(vcf$gt1+vcf$gt2)
str(og)

p2 <- prcomp((og[,1:500]) )

plot(p2$x[,1],p2$x[, 2], xlim=c(-10,10),ylim=c(-10,10))
points(p1$x[,1],p1$x[, 2],pch=20,col="blue")
jcgrenier commented 4 years ago

Hello @adimitromanolakis ,

Thanks a lot! I'll give it a look. It seems like you don't extract the genotypes the same way I did. What do you think might cause the difference? You are creating an additive genotype I guess?

I was doing the PCA using some methods made up for genotyping datasets. I will try using your way!

Thanks!

JC