davidebolo1993 / VISOR

VarIant SimulatOR for short, long and linked reads
GNU Lesser General Public License v3.0
41 stars 11 forks source link

questions about nested events #4

Closed jiadong324 closed 4 years ago

jiadong324 commented 4 years ago

Hi,

I've done the following steps to simulate: 1) Adding known variants to each chromosome by haplotypes to create .h1.fa and .h2.fa. 2) Adding complex SVs to haplotype one (h1). 3) Using SHoRt to run simulation of coverage 10. Purity and contamination are set to be 100.

Here is one site after simulation, red lines are outer breakpoints of nested events of two adjacent duplication. I write a script to randomly make combinations of these basic events to make nested events as you suggested before.

chr1,18343924,18347978 tandem duplication chr1,18347979,18348745 inverted tandem duplication

image

My question are 1) what is the fraction of simulating short reads from each haplotype? 2) For tandem duplications, the duplicated sequence is directly append after the current sequence. So, from my understanding, two adjacent tandem duplication may not be suitable for the simulation.

Thanks!

davidebolo1993 commented 4 years ago

Hi @jiadong324,

  1. Assuming you are using a 10X coverage and 2 haplotypes in the input folder, each haplotype should be covered 5X (for 10X coverage and 3 haplotypes in the input folder, it would be 3,33X for each haplotype, and so on).
  2. Assuming that you are specifying in BED for haplotype 1 (during the HACk step) a tandem duplication and an adjacent inverted tandem duplication, you should see the 2 SVs one after the other (which is exactly what you specified). From your previous issue (please close, if you are not going to comunicate through that one anymore), I got that you were looking for SVs flanked by inversions. In this case, just specify inversion (instead of inverted tandem duplication) in the BED for HACk and you should see what you are looking for.

Best,

Davide

jiadong324 commented 4 years ago

Hi @davidebolo1993,

Thanks for the reply. I've closed the previous issue.

As you mentioned it is 5X on each haplotype. But for the adjacent tandem duplication shown in IGV, there are more normal aligned read-pairs for the first tandem duplication.

Flanking inversion is just one case. I want to make nested events by randomly combining the those events supported by VISOR, so that I can produce different types of complex SVs.

Thanks!

davidebolo1993 commented 4 years ago

Hi @jiadong324,

I'm not sure I completely got your question. You said that you see more "normal" read pairs for the first tandem duplication but from IGV I can just see an increase in coverage for the 2 simulated tandem duplications, as expected. I see that for the first tandem duplication you have few more reads than for the second (the inverted one), but this makes sense as reads are drawn at random for all the region/chromosome specified in BED for SHORtS and this is something that happens in true-to-life duplications.

Let me know if I missed something.

Best,

Davide

jiadong324 commented 4 years ago

Hi @davidebolo1993

The simulation is correct if you look at the coverage. My concern is:

For example, if we sequence 10 reads for this region from two haplotypes, ideally 5 reads may sequenced from the SV haplotype. If this is true, then it is expected to have more abnormal read pairs in green (reverse-forward mapping) than observed in the IGV. It looks fine for the second inverted tandem duplication.

Thanks!

davidebolo1993 commented 4 years ago

Hi @jiadong324,

sorry but there is something I'm still missing. If the second tandem duplication (the inverted one) looks fine, than the first one looks fine as well. Indeed, for a non-inverted tandem duplication, there shouldn't be any abnormal read pair, if I understood correctly what you mean.

Best,

Davide