cerebis / sim3C

Read-pair simulation of 3C-based sequencing methodologies (HiC, Meta3C, DNase-HiC)
GNU General Public License v3.0
19 stars 5 forks source link

Unrealistic with normal insert size between HiC read pairs? #25

Closed ivargr closed 5 months ago

ivargr commented 1 year ago

Hi!

I've been using sim3C for simulating HiC read pairs for benchmarking assembly scaffolders that use HiC data. I've noticed that the distance between HiC read pairs is chosen to be normally distributed (https://github.com/cerebis/sim3C/blob/4a224a5f531cb4bab06fae39530c44142efcc644/sim3C/simulator.py#L164).

I have limited experience with HiC data, but isn't normal insert size unrealistic? I thought pairs of HiC reads were supposed to be close to each other, with a long tail towards a long distances? Or am I wrong?

cerebis commented 5 months ago

Rather than dealing with ligation products from 3C protocols, the code you refer to models the size of conventional whole-genome shotgun (WGS) inserts, which go on to produce conventional WGS read-pairs.

For ligation products derived from the same molecule, their separation is modelled by a composite function found here:

https://github.com/cerebis/sim3C/blob/4a224a5f531cb4bab06fae39530c44142efcc644/sim3C/empirical_model.py#L30