bcgsc / NanoSim

Nanopore sequence read simulator
Other
233 stars 56 forks source link

Sequence length and chimeric positions #119

Open mhuang00 opened 3 years ago

mhuang00 commented 3 years ago

Hello,

I am using the chimeric read simulation function, and would like to use the position of chimeric region introduced by the simulator. However, I can't make sense of the header, specifically the number of bases in different regions. They don't seem to match the calculated sequence length as well.

>ref|NC-001137|-[chromosome=V]_468529_unaligned_0_F_0_3236_0
All information before the first _ are chromosome information. 468529 is the start position and unaligned suggesting it should be unaligned to the reference. The first 0 is the sequence index. F represents a forward strand. 0_3236_0 means that sequence length extracted from the reference is 3236 bases.

For example, in the example given - the sequence length is 3236.

>CM009455_67889975;aligned_71_R_6_14611_5
>CM009451_58207256;CM009458_80102289;aligned_93_chimeric_R_5_1127;7966_7

In this examples I've picked out, their calculated sequence lengths are 14157 and 11244 respectively. I can't seem to sum them up to their sequence length.

  1. I can't seem to sum them up to their sequence length, 6+14611+5 = 14622 and 5+1127+7966+7 = 9105 respectively.
  2. Is 5+1127 the starting position of the chimeric region? If not, how can I calculate the starting position of the chimeric region?

Thanks!

cheny19 commented 3 years ago

Hi @mhuang00,

Q1: First of all, the number in the middle, for example 14611, 1127, and 7966 are the aligned bases on the reference genome, not on the simulated sequence, so it is a bit off. Second, there are gaps between segments in chimeric reads, and the lengths of the gaps are not reflected in the header, so the length for the chimeric read deviates further in your calculation.

Q2: 5 is the unaligned head region of the chimeric reads, so the aligned part of the first segment starts at position 6 (1-indexed).

It seems you are not using the latest version of NanoSim. The output of the latest version provides information about the gap sizes as well.

Feel free to contact if you find anything unclear.

Cheers, Chen