adimitromanolakis / sim1000G

Simulation of rare and common variants based on 1000 genomes data
17 stars 1 forks source link

Mechanism of startSimulation #11

Open zhangbs92 opened 3 years ago

zhangbs92 commented 3 years ago

Hi all,

Suppose my vcf file has 10 unrelated individuals and 100 variants, than I want to generate 1 family with 6 kids (need 2 individual from original 10) and 8 unrelated individual.

I did in this way startSimulation(vcf, totalNumberOfIndividual=10)

fam_data <- newFamilyWithOffspring(1, 6)

pop_id <- generateUnrelatedIndividuals(8)

but see error message saying to increase total number in startSimulation.

This makes me confused, if I increase to 20 that will solve the problem, but what's the mechanism behind it? Are those 20 individual correlated somehow? Or can I use one startSimulation and sample family and population data as I did? My goal is to use 2 out of 10 as founders to generate a family data and the rest 8 to form a population data, how to do it?

Best,

zhangbs92 commented 3 years ago

A following question is, I only have 10 unrelated individuals in my vcf file, how can it possible to simulate more than 10 unrelated individuals? Are those individuals really unrelated?

adimitromanolakis commented 3 years ago

Hi,

I think the confusion is because the totalNumberOfIndividual probably should have been named maxNumberOfIndividuals. It means what is the maximum number of individuals that will be ever in the simulation. It is just a technicality, to pre-allocate space for the data. You can use any larger number that the number of individuals you are going to simulate and the results will be the same.

For your second question: sim1000G computes the LD structure of the individuals of the region and then uses this information to generate new individuals. So, you can generate more individuals than the total number in the VCF file, but obviously, if you only have 10 then the accuracy of the LD computations will not be great. But in any case, even if you generate 10000 new individuals they will all be unrelated.

Best, Apostolos

On Fri, 30 Jul 2021 at 20:20, zhangbs92 @.***> wrote:

A following question is, I only have 10 unrelated individuals in my vcf file, how can it possible to simulate more than 10 unrelated individuals? Are those individuals really unrelated?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/adimitromanolakis/sim1000G/issues/11#issuecomment-890039556, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEGJVY3BQDGCQBA665NGEZTT2LNNTANCNFSM5BI4S5AA .

zhangbs92 commented 3 years ago

Hi, I think the confusion is because the totalNumberOfIndividual probably should have been named maxNumberOfIndividuals. It means what is the maximum number of individuals that will be ever in the simulation. It is just a technicality, to pre-allocate space for the data. You can use any larger number that the number of individuals you are going to simulate and the results will be the same. For your second question: sim1000G computes the LD structure of the individuals of the region and then uses this information to generate new individuals. So, you can generate more individuals than the total number in the VCF file, but obviously, if you only have 10 then the accuracy of the LD computations will not be great. But in any case, even if you generate 10000 new individuals they will all be unrelated. Best, Apostolos

Thank you so much Apostolos. Actually it was my bad, I mixed up simulate and sample. I checked the source code and found that it simulated based on the "distribution" of the input data, the main goal is to preserve the allele frequency and LD structure of the input data, now I am totally clear with it.