HadrienG / InSilicoSeq

:rocket: A sequencing simulator
https://insilicoseq.readthedocs.io
MIT License
184 stars 32 forks source link

ISS halts with: KeyError: '>' #159

Closed jfrank87 closed 4 years ago

jfrank87 commented 4 years ago

Dear ISS developers and community,

The issue Thanks for the software! I run into the following error: KeyError: '>' (I provided the complete stderr below). Any help is greatly appreciated.

What I'm trying to do I'm trying to simulate a MiSeq sequencing dataset for:

Command used

iss generate --debug --seed 666 --ncbi bacteria --n_genomes_ncbi 30 --genomes genome_strains.fasta --model miseq --gc_bias --n_reads 25M --output data_med_strain1 --compress --cpus 32

Output From the log and generated files, I understand:

Notes Note that ISS works fine when I exclude the --genomes option (only download randomly from NCBI). The error makes me thing something could be off with the headers in the FASTA file. However genomes were downloaded directly from RefSeq, and headers look fine:

gi|88193823|ref|NC_007795.1| Staphylococcus aureus subsp. aureus NCTC 8325 chromosome, complete genome gi|384865886|ref|NC_017342.1| Staphylococcus aureus subsp. aureus TCH60, complete sequence gi|1176473263|ref|NZ_CP013821.1| Piscirickettsia salmonis strain PM25344B, complete genome gi|1128917433|ref|NZ_CP013768.1| Piscirickettsia salmonis strain PM23019A, complete genome gi|189438863|ref|NC_010816.1| Bifidobacterium longum DJO10A, complete genome NZ_CP008885.1 Bifidobacterium longum strain BXY01, complete genome gi|213690928|ref|NC_011593.1| Bifidobacterium longum subsp. infantis ATCC 15697 = JCM 1222 = DSM 20088, complete sequence gi|1092886434|ref|NZ_CP017403.1| Bordetella pertussis strain 509 chromosome, complete genome gi|1057650404|ref|NZ_CP016431.1| Bordetella bronchiseptica strain I328 chromosome, complete genome gi|560885319|ref|NZ_CM002059.1| Mycobacterium tuberculosis PanR0410 chromosome, whole genome shotgun sequence gi|433629070|ref|NC_019951.1| Mycobacterium canettii CIPT 140070010, complete sequence gi|433633012|ref|NC_019952.1| Mycobacterium canettii CIPT 140070017, complete sequence gi|1063812051|gb|CP017100.1| Escherichia coli strain K-12 NEB 5-alpha chromosome, complete genome gi|291280824|ref|NC_013941.1| Escherichia coli O55:H7 str. CB9615, complete genome gi|222154829|ref|NC_011993.1| Escherichia coli LF82 chromosome, complete sequence gi|983374764|ref|NZ_CP013029.1| Escherichia coli strain 2012C-4227 chromosome, complete genome gi|1053250915|ref|NZ_CP015855.1| Escherichia coli strain EDL933-1 genome gi|1152328528|ref|NZ_CP018976.1| Escherichia coli strain Ecol_545 chromosome, complete genome gi|1149039824|gb|CP009259.1| Helicobacter pylori SS1 chromosome, complete genome gi|254778738|ref|NC_012973.1| Helicobacter pylori B38, complete sequence gi|1172552396|gb|CP020551.1| Streptococcus pneumoniae strain Hu15 genome gi|1121073309|ref|NZ_CP018347.1| Streptococcus pneumoniae strain SWU02 chromosome, complete genome

sterr Please find complete log attached. ISS_336513_stderr.txt

"""
Traceback (most recent call last):
  File "/proj/jfrank/bm_benchmark/conda_env/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 418, in _process_worker
    r = call_item()
  File "/proj/jfrank/bm_benchmark/conda_env/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 272, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/proj/jfrank/bm_benchmark/conda_env/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 608, in __call__
    return self.func(*args, **kwargs)
  File "/proj/jfrank/bm_benchmark/conda_env/lib/python3.7/site-packages/joblib/parallel.py", line 256, in __call__
    for func, args, kwargs in self.items]
  File "/proj/jfrank/bm_benchmark/conda_env/lib/python3.7/site-packages/joblib/parallel.py", line 256, in <listcomp>
    for func, args, kwargs in self.items]
  File "/proj/jfrank/bm_benchmark/conda_env/lib/python3.7/site-packages/iss/generator.py", line 65, in reads
    forward, reverse = simulate_read(record, ErrorModel, i, cpu_number)
  File "/proj/jfrank/bm_benchmark/conda_env/lib/python3.7/site-packages/iss/generator.py", line 139, in simulate_read
    forward, 'forward', sequence, bounds)
  File "/proj/jfrank/bm_benchmark/conda_env/lib/python3.7/site-packages/iss/error_models/__init__.py", line 189, in introduce_indels
    if random.random() < deletions[position][mutable_seq[nucl].upper()]:
KeyError: '>'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/proj/jfrank/bm_benchmark/conda_env/bin/iss", line 10, in <module>
    sys.exit(main())
  File "/proj/jfrank/bm_benchmark/conda_env/lib/python3.7/site-packages/iss/app.py", line 547, in main
    args.func(args)
  File "/proj/jfrank/bm_benchmark/conda_env/lib/python3.7/site-packages/iss/app.py", line 260, in generate_reads
    args.gc_bias, mode) for i in range(cpus))
  File "/proj/jfrank/bm_benchmark/conda_env/lib/python3.7/site-packages/joblib/parallel.py", line 1017, in __call__
    self.retrieve()
  File "/proj/jfrank/bm_benchmark/conda_env/lib/python3.7/site-packages/joblib/parallel.py", line 909, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/proj/jfrank/bm_benchmark/conda_env/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 562, in wrap_future_result
    return future.result(timeout=timeout)
  File "/proj/jfrank/bm_benchmark/conda_env/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/proj/jfrank/bm_benchmark/conda_env/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
KeyError: '>'
jfrank87 commented 4 years ago

Hi All,

I looked into this a bit more, yet I could not find an issue with the FASTA file described above. I used a range other programs all parsed and used the FASTA file without a problem. However, I will close this issue for now since InSillicoSeq does perform flawlessly with all other files I tested with, making me think and hope the problem above is rare/exceptional or perhaps due to user error. Thanks!

HadrienG commented 4 years ago

HI!

Thanks for the detailed bug report, and I'm happy to hear you don't run into this issue anymore. Sorry I had not come back to you sooner, but I tested a few parameter combinations this morning and could not reproduce this either. Do you mind attaching genome_strains.fasta. If I understand correctly it always crashes with that file?

Thanks, Hadrien.