bcgsc / NanoSim

Nanopore sequence read simulator
Other
217 stars 51 forks source link

Index out of range errors #166

Open RagnarGrootKoerkamp opened 2 years ago

RagnarGrootKoerkamp commented 2 years ago

I'm getting some index out of range errors, possibly because of setting the same value (or too close?) for -min and -max:

-min 10000 -max 10000:

2022-04-21 13:17:35: Start simulation of aligned reads
Process Process-1:
Traceback (most recent call last):
  File "/home/philae/.local/share/miniconda3/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/philae/.local/share/miniconda3/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/philae/.local/share/miniconda3/bin/simulator.py", line 1293, in simulation_aligned_genome
    remainder = int(remainder_lengths[each_read])
IndexError: list index out of range

and

-min 900000 -max 1100000:

2022-04-21 13:19:34: Start simulation of aligned reads
Process Process-1:
Traceback (most recent call last):
  File "/home/philae/.local/share/miniconda3/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/philae/.local/share/miniconda3/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/philae/.local/share/miniconda3/bin/simulator.py", line 1294, in simulation_aligned_genome
    head_vs_ht_ratio = head_vs_ht_ratio_list[each_read]
IndexError: list index out of range
SaberHQ commented 2 years ago

With the first case, obviously it is not logical to set min and max length equal to each other. With your second case scenario, I suspect that the reference genome you are using is smaller than the read lengths you specified. May I ask whether you are using the pre-trained models or if you trained your own model?

RagnarGrootKoerkamp commented 2 years ago

With the first case, obviously it is not logical to set min and max length equal to each other.

Hmm OK, that wasn't obvious to me. I would like to generate some reads to test a pairwise aligner I'm working on, and to benchmark it, it is nice to have reads of a specific length. I changed it some some interval around it and it works now. Anyway, displaying a warning of just crashing would be nice ;)

With your second case scenario, I suspect that the reference genome you are using is smaller than the read lengths you specified.

Oh right, that may well be the case. I am using some human genome reference but I noticed my fasta file also has some shorter sequences in addition to the long chromosomes. Again, a warning message would be nice.

May I ask whether you are using the pre-trained models or if you trained your own model?

I'm using pre-trained models, since I don't have direct access to reads.

My full NanoSim invocation is this, where {..} will be substituted by snakemake:

    simulator.py genome \
    --ref_g input/reference/human.fa \
    --output input/simulated/human-x{wildcards.x}-n{wildcards.n} \
    -dna_type linear \
    --model_prefix ../../nanosim/pre-trained_models/human_NA12878_DNA_FAB49712_guppy/training \
    --min_len {params.min} \
    --median_len {wildcards.n} \
    --max_len {params.max} \
    --sd_len 1.05 \
    --number {params.generate_x} \
    --strandness 1 \
    --seed 314151 \
    --num_threads 6