cerebis / sim3C

Read-pair simulation of 3C-based sequencing methodologies (HiC, Meta3C, DNase-HiC)
GNU General Public License v3.0
19 stars 5 forks source link

The header format #13

Closed hsinnan75 closed 4 years ago

hsinnan75 commented 4 years ago

Hi, could you please explain the meaning of each field in the header of output file? For example,

@SIM3C:1572507618:WGS:1:1:1:4 2:Y:18:1 WGS NC_000913:3003159..3003486:R

and

@SIM3C:1572507618:3C:1:1:1:1 2:Y:18:1 HIC NC_000913:3022391 NC_000913:4567348

I ran sim3C with -m hic, however, some reads are assigned with "WGS", while others are assigned with "HIC". In the former cases, the last character is either F or R. It is a bit confusing.

Thank you!

cerebis commented 4 years ago

You've found a clear hole in the documentation.

The headers roughly follow the Illumina format documented here , although the fields are not strictly being used in the proper sense.

From Illumina:

@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos>:<UMI> <read>:<is filtered>:<control number>:<index>

Mapping these fields to how they are used in sim3C.

Field Comment
Instrument Indicates this read was produced by sim3C (always SIM3C)
Run number The random seed used during simulation
Flowcell ID Used to convey the type of read-pair emitted. (WGS, HIC or META3C, etc)
Lane Not used and always 1
Tile Not used and always 1
x-pos Not used and always 1
y-pos A unique integer incremented as pairs are emitted during simulation
UMI Defined as optional and not used by sim3C (not written)
read Properly used to indicate whether the read is the first or second in pair (1 or 2)
is_filtered A flag which signifies whether the read is filtered. (always Y)
control bits Not used and always 18
index Not used and always 1

After this Illumina-style header, sim3C includes a string which varies between WGS or 3C-style pairs.

For WGS the string defines the insert fragment which was used to create the read-pair. This includes the reference ID, the beginning and end coordinates of the fragment, and the orientation (F: forward, R: reverse).

For 3C-style pairs, the string is similar but since a fragment is the product of a ligation event, it encodes two reference regions.

Note Its worth noting that control bits and index fields should really be revised to be 0 and perhaps ACGT to at least comply with what is expected.

hsinnan75 commented 4 years ago

Thanks for the explanation!