bcgsc / NanoSim

Nanopore sequence read simulator
Other
217 stars 51 forks source link

Index out of range #53

Closed Raymi1 closed 4 years ago

Raymi1 commented 5 years ago

Hello I'm trying to run nanofilt and after running simulation.py I get the following error message :

File "~/soft/nanosim/src/simulator.py", line 739, in <module> main() File "~/soft/nanosim/src/simulator.py", line 733, in main max_readlength, min_readlength) File "~/soft/nanosim/src/simulator.py", line 295, in simulation read_mutated = mutate_read(new_read, new_read_name, out_error, error_dict, kmer_bias, False) File ~/soft/nanosim/src/simulator.py", line 579, in mutate_read tmp_bases.remove(read[key + i]) IndexError: string index out of range I tried to change --max_len however it doesn't have any effect.

Do you know what's the issue ?

Thanks for your help

cheny19 commented 5 years ago

Hi @Raymi1,

I'm afraid I have no idea without more details. Could you provide the command you used to run NanoSim, including both the profiling stage and simulation stage?

Thanks, Chen

seryrzu commented 5 years ago

Hi, I'm confirming the issue. I am using ONT data from T2T consortium. More specifically this dataset. I selected first million of reads out of it and used their assembly as a reference.

I simulate reads from a simulated genome of length 120kb.

Commands: ~/soft/NanoSim/src/read_analysis.py -i rel2_1mm.fasta -r ../../../assemblies/chm13.draft_v0.4.fasta -t 32 and ~/soft/NanoSim/src/simulator.py linear -r simulated_genome.fasta

The report of second command:

Traceback (most recent call last):
  File "/home/abzikadze/soft/NanoSim/src/simulator.py", line 739, in <module>
    main()
  File "/home/abzikadze/soft/NanoSim/src/simulator.py", line 733, in main
    max_readlength, min_readlength)
  File "/home/abzikadze/soft/NanoSim/src/simulator.py", line 295, in simulation
    read_mutated = mutate_read(new_read, new_read_name, out_error, error_dict, kmer_bias, False)
  File "/home/abzikadze/soft/NanoSim/src/simulator.py", line 579, in mutate_read
    tmp_bases.remove(read[key + i])
IndexError: string index out of range

What I also noticed is that the read_analysis.py script reported warnings:

WARNING! Mismatch parameters may not be optimal!
 [ 0.09585138  0.76972756  0.12696229] 9.27237638217e-05
2019-03-29 01:18:23: Mismatch fitting done
2019-03-29 01:18:23: Insertion fitting start
WARNING! Insertion parameters may not be optimal!
 [ 0.99150715  1.11476135  0.37797238  0.86210433] 0.000647619472938
2019-03-29 01:30:20: Insertion fitting done
2019-03-29 01:30:20: Deletion fitting start
/home/abzikadze/soft/NanoSim/src/mixed_model.py:24: RuntimeWarning: overflow encountered in power
  wei_cdf = 1 - np.exp(-1 * np.power(x / l, k))
WARNING! Deletion parameters may not be optimal!
 [ 0.98106902  0.97828069  0.21263187  0.95179468] 0.000780146024089
2019-03-29 01:45:51: Deletion fitting done
2019-03-29 01:45:51: Finished!

I'm not sure that it is part of the issue.

seryrzu commented 5 years ago

I suspect that the reference genome at simulation step is "too short", thus simulated read is of the same length as the genome (I guess, lines 431--433), but the e_dict with mutations is not checked for overflows in case read is the same length as the genome.

For example print(len(read), key, key+i, val) on line 579 in simulator.py gives me 122752 148920 148920 ['mis', 4]

Thus, temporary fix should be just adding right before line 579:

if key + i >= len(read): break
cheny19 commented 5 years ago

Hi @seryrzu ,

Thanks for pointing out this issue. In the next release, I'll report a warning when the simulated read length is longer than the reference genome.

The warning you saw in the profiling stage is not related to this issue. That occurs when the fitted model is not statistically identical to the empirical distribution. However, since the discrepancy is small, the model fitting is still considered OK and can be used for the simulation step. We are working on new models as well.

Raymi1 commented 5 years ago

I tried with another longer genome and it has worked. So it's consistant with the explanation of @seryrzu.

@cheny19, here are the command lines I used: Profiling stage : ~/soft/nanosim/src/read_analysis.py -i ERR2025972.fasta -r CP009685.fasta Simulation stage : ~/soft/NanoSim/src/simulator.py linear -r ../GCA_001393175.1_7054_1_35_genomic.fasta -c training

SaberHQ commented 4 years ago

Dear @Raymi1 and @seryrzu

Thanks for bringing this up. In the newest release, we do check this situation and skip cases in which the read length is longer than the reference. Please note that it is recommended to use the same reference for both characterizing and simulation steps. Obviously, when your average real lengths are longer than the reference you use for simulation, you may encounter this problem more often.

Please use the latest versions and let me know if anything is wrong.