bcgsc / NanoSim

Nanopore sequence read simulator
Other
217 stars 51 forks source link

Interpreting starting- and insertion-positions #67

Closed BenjaminAlbrecht84 closed 4 years ago

BenjaminAlbrecht84 commented 4 years ago

Hello,

i am trying to somehow reengineer how the reads are simulated from the genome. Related to that i have some questions:

  1. Is the starting index (as given within the read name) always related to the plus strand of the genome (especially if the read comes from the reverse strand)? In other words, related to the plus strand, what is the starting index and the end index of the read named ref|NC-001143|-[chromosome=XI]_115406_aligned_16565_R_92_12710_2 ?
  2. Do the positions in the error profile always refer to the plus strand? In other words, what does it mean if for a reverse read an insertion is denoted at postion 100 of length 2?

Thanks for your help!

cheny19 commented 4 years ago

Hi,

  1. The positions are relative to the source strand. In you example, it ends at 115406 and starts at 115406 - 12710 on chromosome XI.

  2. The position in the error profile is relative to the plus strand. If there is an insertion at position 100 for a reverse read, it actually should be calculated from the back of the read.

Hope that clarifies it.

Chen

BenjaminAlbrecht84 commented 4 years ago

Sorry to press you, but i have code computing precision and recall for a set of NanoSim reads. If i change the start- and end position as suggested above my recall drops at about 50%.

So i would guess that in the reverse case the start and end positions are like in the forward case (related to the forward strand). In other words, for the read named ref|NC-001143|-[chromosome=XI]_115406_aligned_16565_R_92_12710_2 the start position is 115406 and the end position 115406 + 12710 (or the other way round, depends on the point of view).

Maybe i miss sth, could you please check again?

cheny19 commented 4 years ago

Sorry I made a mistake, yes, you are right, it should be 115406 + 12710. I was just trying to say it ends at 115406.