Reading the simulated_aligned_reads

bcgsc / NanoSim

Nanopore sequence read simulator

Other

233 stars 56 forks source link

Reading the simulated_aligned_reads #209

Closed Evandio-Martin closed 1 month ago

Evandio-Martin commented 3 months ago

I want to analyze the output of NanoSim based on simulated_aligned_reads and compare it with the input of the human reference genome from GRCh37 and using the pre-trained human guppy model provided from NanoSim.

I have question on how to read this
NC-000011_21773883_aligned_2_F_2_2258_40
- based on the readme file, 21773883 is the start position. does it mean the character index from the top left right of the input? meaning we should start counting from the NNNNN?
- 2 is the sequence index. I don't understand this part. How many lines is in each sequences?
Last question, is it possible to compare the input and output to check the difference from the NanoSim outputs?

Thank you very much

SaberHQ commented 3 months ago

Thanks for your interest in using NanoSim @Evandio-Martin

Start position is the start index on the reference. If it is a genomic read simulation, it is a random position on the reference genome where NanoSim extracts the reads from. In your example, 21773883 is the pythonic start position on that chromosome.
Sequence index is a unique identifier for the sequence generated.

I did not get your second question. What do you mean by input and output? Did you train a NanoSim model yourself or did you use a pre-trained model? If you used a pre-trained model, by "input" do you mean the reference genome used for the simulation? In what aspects do you want to compare the reference genome and simulated reads?

Evandio-Martin commented 3 months ago

Thank you very much for you answer,

For the second question, that's right, I'm using the pre-trained model. I refer to the GRCh37 reference genome as the input and I'm analyzing the simulated_aligned_reads as the output. I don't know if my thinking is right or not but I'm using the start position of the simulated_aligned_reads of certain chromosome and then use that start position on the reference genome of that certain chromosome. From there, I am comparing between these two to check how much is the difference from input to the output of the ./simulator.py.

I tried this one but it turns out it is totally different so I thought that how I read the start position is wrong because I'm not sure we count the start position from the Ns or after the Ns of the chromosome or not.

And I cannot analyze from the sequence index because I don't know when does the sequence start or stop because there is no sequence separation from start until end.

Thank you very much.

Evandio-Martin commented 3 months ago

Ah, sorry. I just realized about how the sequence index works. So that means if I want to analyze 1 chromosome. The sequence index doesn't matter because it's just an identifier for each sequence right? So, I should only focus on the start position