HadrienG / InSilicoSeq

:rocket: A sequencing simulator
https://insilicoseq.readthedocs.io
MIT License
176 stars 32 forks source link

Biopython error with custom modell: "TypeError: We only allow python sequences (lists, tuples or strings) of length 36" #218

Closed TakacsBertalan closed 10 months ago

TakacsBertalan commented 2 years ago

Hi! I am trying to create in silico metagenome data with my own model. I created my modell from human stool shot-gun sequenced samples (Illumina NextSeq) following the "Error models" part of the documentation megahit -1 reads_R1.fastq -2 reads_R2.fastq -o asm bowtie2-build asm/final.contigs.fa miseq_asm/final.contigs bowtie2 -x asm/final.contigs -1 reads_R1.fastq \ -2 reads_R2.fastq | samtools view -bS | samtools sort -o mapping.bam samtools index mapping.bam iss model -b mapping.bam -o name_of_model After this, I tried to use my model to generate 5000000 reads from 200 bacterial genomes randomly downloaded from NCBI. iss generate --ncbi bacteria -U 200 --model ./name_of_model --n_reads 5000000 -o ./output The download of the genomes went without an issue, but at the read generation part I run into the following error message:

INFO:iss.download:Downloading GCF_014961985.1
INFO:iss.download:Downloading GCF_003955865.1
INFO:iss.download:Downloading GCF_005221645.1
INFO:iss.util:Stitching input files together
INFO:iss.app:Using lognormal abundance distribution
INFO:iss.app:Using 2 cpus for read generation
INFO:iss.app:Generating 5000000 reads
INFO:iss.app:Generating reads for record: NZ_CP077680.1
INFO:iss.app:Generating reads for record: NZ_CP016813.1
INFO:iss.app:Generating reads for record: NZ_CP022611.1
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/deltagene/anaconda3/envs/insilicoseq/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 436, in _process_worker
    r = call_item()
  File "/home/deltagene/anaconda3/envs/insilicoseq/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 288, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/home/deltagene/anaconda3/envs/insilicoseq/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "/home/deltagene/anaconda3/envs/insilicoseq/lib/python3.9/site-packages/joblib/parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "/home/deltagene/anaconda3/envs/insilicoseq/lib/python3.9/site-packages/joblib/parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/deltagene/anaconda3/envs/insilicoseq/lib/python3.9/site-packages/iss/generator.py", line 62, in reads
    forward, reverse = simulate_read(record, ErrorModel, i, cpu_number)
  File "/home/deltagene/anaconda3/envs/insilicoseq/lib/python3.9/site-packages/iss/generator.py", line 163, in simulate_read
    reverse = ErrorModel.introduce_error_scores(reverse, 'reverse')
  File "/home/deltagene/anaconda3/envs/insilicoseq/lib/python3.9/site-packages/iss/error_models/__init__.py", line 65, in introduce_error_scores
    record.letter_annotations["phred_quality"] = self.gen_phred_scores(
  File "/home/deltagene/anaconda3/envs/insilicoseq/lib/python3.9/site-packages/Bio/SeqRecord.py", line 89, in __setitem__
    raise TypeError(
TypeError: We only allow python sequences (lists, tuples or strings) of length 36.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/deltagene/anaconda3/envs/insilicoseq/bin/iss", line 10, in <module>
    sys.exit(main())
  File "/home/deltagene/anaconda3/envs/insilicoseq/lib/python3.9/site-packages/iss/app.py", line 608, in main
    args.func(args)
  File "/home/deltagene/anaconda3/envs/insilicoseq/lib/python3.9/site-packages/iss/app.py", line 306, in generate_reads
    record_file_name_list = Parallel(
  File "/home/deltagene/anaconda3/envs/insilicoseq/lib/python3.9/site-packages/joblib/parallel.py", line 1056, in __call__
    self.retrieve()
  File "/home/deltagene/anaconda3/envs/insilicoseq/lib/python3.9/site-packages/joblib/parallel.py", line 935, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/deltagene/anaconda3/envs/insilicoseq/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/home/deltagene/anaconda3/envs/insilicoseq/lib/python3.9/concurrent/futures/_base.py", line 438, in result
    return self.__get_result()
  File "/home/deltagene/anaconda3/envs/insilicoseq/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
TypeError: We only allow python sequences (lists, tuples or strings) of length 36.

I got the same error for mupltiple different custom models. For some, it seemed that the read generation was performed properly, but I checked out the fastq files and the read length was only 36 bases long for each read.

@NZ_CP027555.1_0_0/2
AAATCATCAGGTGTACGGTGTGCGTAAAGTCTGGC
+
AAAAAEEAEEEEEEEE6EEEEEEEEEEEEEEEEEE

Has anybody ran into the same issue? What could cause it? Thanks, Bertalan Takács

EDIT: Okay, I checked out my model and it seems like that the read length for some reason is set to 36. That explains at least one part of my problem, why the generated reads are short. Still, I don't know how this was set (the reads were generated from a sample with pretty good coverage).

HadrienG commented 10 months ago

Hi!

All input mapped reads should be of the same length (which will be the set read length in the model).

/Hadrien