cerebis / sim3C

Read-pair simulation of 3C-based sequencing methodologies (HiC, Meta3C, DNase-HiC)
GNU General Public License v3.0
19 stars 5 forks source link

What does readID represent in the output simulation data? #40

Open xujialupaoli opened 2 weeks ago

xujialupaoli commented 2 weeks ago

Thank you for providing such a useful software. I used sim3C to simulate a hic data for my E. coli genome. Use the following code:

sim3C --profile mycom.txt -n 5000000 -l 150 -e DpnII -m hic /home/work/jialu/tetraploid_assembly/simulate_data/strain_fq/ref_genome/hap4.fa hap4_R1.fq hap4_R2.fq

The content of mycom.txt is as follows:

image

Here is the readID of the simulated data output:


$ cat  /home/work//simulate_data/strain_fq/hic/hap1/hap1_R2.fq |grep "^@" |head -n 7
@SIM3C:3C:1:1:1:1 2:Y:18:1 HIC hap1:2220392 hap1:4072595
@SIM3C:WGS:1:1:1:2 2:Y:18:1 WGS hap1:3356458..3356840:R
@SIM3C:3C:1:1:1:3 2:Y:18:1 HIC hap1:3940006 hap1:3456994
@SIM3C:3C:1:1:1:4 2:Y:18:1 HIC hap1:22115 hap1:195624
@SIM3C:WGS:1:1:1:5 2:Y:18:1 WGS hap1:4349782..4350231:F
@SIM3C:WGS:1:1:1:6 2:Y:18:1 WGS hap1:3599187..3599569:R
@SIM3C:3C:1:1:1:7 2:Y:18:1 HIC hap1:622455 hap1:4763592

$ cat  /home/work/simulate_data/strain_fq/hic/hap4/hap4_R2.fq |grep "^@" |head -n 7
@SIM3C:WGS:1:1:1:1 2:Y:18:1 WGS hap4:3272044..3272421:F
@SIM3C:WGS:1:1:1:2 2:Y:18:1 WGS hap4:4037903..4038223:F
@SIM3C:WGS:1:1:1:3 2:Y:18:1 WGS hap4:555578..556018:F
@SIM3C:WGS:1:1:1:4 2:Y:18:1 WGS hap4:88266..88635:F
@SIM3C:3C:1:1:1:5 2:Y:18:1 HIC hap4:3929062 hap4:3222724
@SIM3C:3C:1:1:1:6 2:Y:18:1 HIC hap4:2390316 hap4:397284
@SIM3C:3C:1:1:1:7 2:Y:18:1 HIC hap4:1763726 hap4:706931

I don't understand what the readID naming in the output means? In addition, why do some reads start with "@SIM3C:3C" and some start with "@SIM3C:WGS:", are there any differences between these reads? Do these differences lead to different naming meanings for the subsequent "WGS hap1:3356458..3356840:R" and "HIC hap1:22115 hap1:195624"? Looking forward to your reply!