bcgsc / NanoSim

Nanopore sequence read simulator
Other
233 stars 56 forks source link

Simulated sequence lengths #128

Closed AlphaSquad closed 3 years ago

AlphaSquad commented 3 years ago

Hi, when trying to create mapping files/CIGARs from NanoSim output I encountered a problem with the sequence lengths, somewhat similar to #119, but I think it is a separate problem, that is why I am opening this separate Issue.

Using the command simulator.py genome -n 4050 -rg GCA_001900715.1_ASM190071v1.fa -o test/Genome2.0 -c training --seed 1246114144 -dna_type linear using this error model, I get minor discrepancies in the expected vs simulated read lengths:

The first read of this simulation has a total length of insertions of 202 and a total length of deletions of 328, which I calculated using the _error_profile file (length of the insertions/deletions in the fourth column): head -n 528 test/Genome2.0_aligned_error_profile | grep "ins" | cut -f 4 | paste -sd+ | bc and head -n 528 test/Genome2.0_aligned_error_profile | grep "del" | cut -f 4 | paste -sd+ | bc This read has no unaligned bases in the end or start (read name is >CP010152_2172479;aligned_0_R_0_6839_0) and the aligned bases of the reference is 6839. Putting things together, I would expect the read to be 6839 (aligned length of reference) + 202 (insertions length) - 328 (deletions length) = 6713 bases long, but the read has length 6710. Do I have a a faulty reasoning for this length calculation? If yes, could you point out how I would end up with the correct length of the read using the _error_profile?

I attached the reference genome (so you can re-run the exact command) and also the output of NanoSim: NanoSim_test.zip

Thank you very much in advance

cheny19 commented 3 years ago

Hi @AlphaSquad,

In NanoSim, we first decide an original middle length on the reference, however the length is changed a bit after introducing errors to ensure the beginning and end of an aligned sequence can be matched. In previous versions of NanoSim we reported the adjusted length, but since v3.0.0, we reported the original middle_ref length instead. This is not an expected behaviour, so I just fixed it and made a new commit. Thanks for reporting it!

Chen

AlphaSquad commented 3 years ago

Thank you very much, the lengths are now as expected!