davidebolo1993 / VISOR

VarIant SimulatOR for short, long and linked reads
GNU Lesser General Public License v3.0
41 stars 11 forks source link

purity question #5

Closed jiadong324 closed 4 years ago

jiadong324 commented 4 years ago

Hi,

I read through the supplementary about purity and also according to what you described in previous issue, I am not sure if I use the parameter correctly.

For example, I am using 50X for two haplotypes h1 and h2, of which h1 hacked by SVs and h2 only by SNPs. Thus, these simulated SVs on h1 is HET (according to your previous explanation), namely, the allele fraction of these SVs is 50% if I understand correctly. Then, if I set purity to 100, the SHoRt will sequence equally from each hap. While I set purity to 80 and keep others unchanged as described above, thus, 80% of the reads come from the hacked two haps and the rest 20% are created from reference genome. Then, the allele fraction of these simulated SVs should be 40%.

In my previous non-hap simulation, if I want to simulated SV of allele fraction 50% at coverage 50X, I would first use wgsim sequence 25X reads from hacked genome and another 25X from reference genome.

Please let me know if I understand correctly of using this parameter, thanks a lot!

Best, Jiadong

davidebolo1993 commented 4 years ago

Hi @jiadong324,

Indeed, you got it right. When simulating from a unique clone (a single HACk folder with one ore more haplotype), the purity column indicates the percentage of reads that are simulated from the reference (for the same region specified in BED for SHORtS/LASeR) with respect of the total coverage (that is, if you simulate 50X at AF 50%, you get 25X from the 2 haplotypes and 25X from reference).

Best,

Davide

jiadong324 commented 4 years ago

Hi @davidebolo1993,

  1. According to your supplementary note, I think the purity indicates percentage of reads simulated from the hacked genome, and the rest are from reference genome.

  2. Let me make the example of AF more clear. If I only hack simulated SVs to one of the two haplotypes, let's say h1. By introducing purity, actually you bring reference as the third virtual hap (g) regarding to the hacked h1 and non-hacked h2. When we start to sequence reads, indeed you are going to get reads from g, h1 and h2.

If you think description in 2 is correct please go to 3.

  1. Assume we set purity to 80%. SHoRt will sequence 80% of the reads from h1 and h2, the rest 20% will come from reference genome (g). If reads generated from h1 and h2 equally, then 40% of reads will come from SV hap, which is h1. Therefore, this will result in SVs of AF 40%.

Sorry for so much questions, I am trying to understand the details and use VISOR in a proper way.

Thanks!

davidebolo1993 commented 4 years ago

Hi @jiadong324,

no worries. I'm happy VISOR stimulates such an interest. The description you gave is perfectly fine and is, indeed, what VISOR does.

Best,

Davide

jiadong324 commented 4 years ago

Yes, VISOR is really helpful to what I am doing now.

So, if I set purity to 100%, and hack simulated SVs on h1. As a result, I will get SVs of AF 50%.

Thanks!