CAMI-challenge / CAMISIM

CAMISIM: Simulating metagenomes and microbial communities
https://data.cami-challenge.org/participate
Apache License 2.0
169 stars 37 forks source link

IndexError: string index out of range #101

Closed FranziskaErber closed 3 years ago

FranziskaErber commented 3 years ago

Hello dear @AlphaSquad, I would like to perform simulations with CAMISIM to generate nanopore data; I was able to download CAMISIM and ran the script with the default files successfully. Fortunately this also worked fine when using 6 sequence files. Now I would like to use around 35 FASTA sequence files [.fna] for this. They vary in composition: the files can contain chromosomal, plasmid and/or phagic DNA sequences with different numbers. I provide the following:

config_nanosim.ini: [Main] seed=202102070000000 phase=0 max_processors=16 dataset_id=CPhPlEH output_directory=out_nanosim_CPhPlEH_6Gb_seed202102070000000 temp_directory=/mnt/volume/LongReads/simulatedMock/CAMISIM/tmp gsa=False pooled_gsa=False anonymous=False compress=1 [ReadSimulator] samtools=/usr/local/bin/samtools readsim=/mnt/volume/LongReads/Tools/NanoSim/src/simulator.py error_profiles=/mnt/volume/LongReads/Tools/CAMISIM/tools/nanosim_profile size=6 type=nanosim

fragments_size_mean=270

fragment_size_standard_deviation=27

[CommunityDesign] distribution_file_paths=/mnt/volume/LongReads/simulatedMock/CAMISIM/CPhPlH-files/CPhPlEH_abundance.tsv ncbi_taxdump=/mnt/volume/LongReads/Tools/CAMISIM/tools/ncbi-taxonomy_20170222.tar.gz strain_simulation_template=/mnt/volume/LongReads/Tools/CAMISIM/scripts/StrainSimulationWrapper/sgEvolver/simulation_dir number_of_samples=1

number_of_communities=1

[community0] metadata=/mnt/volume/LongReads/simulatedMock/CAMISIM/CPhPlH-files/CPhPlH_metadata.tsv id_to_genome_file=/mnt/volume/LongReads/simulatedMock/CAMISIM/CPhPlH-files/CPhPlH_genome_to_id.tsv genomes_total=41 genomes_real=41 max_strains_per_otu=1 ratio=1 mode=differential log_mu=1 log_sigma=2 gauss_mu=1 gauss_sigma=1 view=False

genomes_to_id.tsv: Ecoli_PhH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/Ecoli_PhH.fna GCA_003086655.1_ASM308665v1_CPhH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCA_003086655.1_ASM308665v1_CPhH.fna GCA_014217455.1_ASM1421745v1_genomic /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCA_014217455.1_ASM1421745v1_genomic.fna GCF_000006985.1_ASM698v1_genomic /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_000006985.1_ASM698v1_genomic.fna GCF_000008625.1_ASM862v1_CPlH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_000008625.1_ASM862v1_CPlH.fna GCF_000019965.1_ASM1996v1_genomic /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_000019965.1_ASM1996v1_genomic.fna GCF_000023865.1_ASM2386v1_CPhH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_000023865.1_ASM2386v1_CPhH.fna GCF_000024625.1_ASM2462v1_CPlH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_000024625.1_ASM2462v1_CPlH.fna GCF_000025685.1_ASM2568v1_CPhPlH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_000025685.1_ASM2568v1_CPhPlH.fna GCF_000144645.1_ASM14464v1_genomic /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_000144645.1_ASM14464v1_genomic.fna GCF_000172995.2_ASM17299v2_CPhPlH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_000172995.2_ASM17299v2_CPhPlH.fna GCF_000179575.2_ASM17957v2_CPlH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_000179575.2_ASM17957v2_CPlH.fna GCF_000183405.1_ASM18340v1_CPlH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_000183405.1_ASM18340v1_CPlH.fna GCF_000214355.1_ASM21435v1_genomic /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_000214355.1_ASM21435v1_genomic.fna GCF_000259255.1_ASM25925v1_CPlH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_000259255.1_ASM25925v1_CPlH.fna GCF_000473245.1_ASM47324v1_CPhH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_000473245.1_ASM47324v1_CPhH.fna GCF_000739395.1_ASM73939v1_CPhH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_000739395.1_ASM73939v1_CPhH.fna GCF_000828635.1_ASM82863v1_genomic /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_000828635.1_ASM82863v1_genomic.fna GCF_000833215.1_ASM83321v1_CPlH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_000833215.1_ASM83321v1_CPlH.fna GCF_001293145.1_ASM129314v1_PlH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_001293145.1_ASM129314v1_PlH.fna GCF_001518775.1_ASM151877v1_PlH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_001518775.1_ASM151877v1_PlH.fna GCF_001543105.1_ASM154310v1_CPhH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_001543105.1_ASM154310v1_CPhH.fna GCF_001549695.1_ASM154969v1_PlH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_001549695.1_ASM154969v1_PlH.fna GCF_001729945.1_ASM172994v1_genomic /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_001729945.1_ASM172994v1_genomic.fna GCF_002214545.1_ASM221454v1_CPhH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_002214545.1_ASM221454v1_CPhH.fna GCF_002346025.1_ASM234602v1_CPhH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_002346025.1_ASM234602v1_CPhH.fna GCF_002504385.1_ASM250438v1_genomic /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_002504385.1_ASM250438v1_genomic.fna GCF_003019985.1_ASM301998v1_genomic /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_003019985.1_ASM301998v1_genomic.fna GCF_003047065.1_ASM304706v1_CPhH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_003047065.1_ASM304706v1_CPhH.fna GCF_003253775.1_ASM325377v1_CPhH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_003253775.1_ASM325377v1_CPhH.fna GCF_003491205.1_ASM349120v1_CPhPlH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_003491205.1_ASM349120v1_CPhPlH.fna GCF_003667725.1_ASM366772v1_PlH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_003667725.1_ASM366772v1_PlH.fna GCF_003798325.1_ASM379832v1_CPhH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_003798325.1_ASM379832v1_CPhH.fna GCF_003971565.1_ASM397156v1_PlH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_003971565.1_ASM397156v1_PlH.fna GCF_004168325.2_ASM416832v2_CPlH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_004168325.2_ASM416832v2_CPlH.fna GCF_007794935.1_ASM779493v1_CPhPlH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_007794935.1_ASM779493v1_CPhPlH.fna GCF_007917035.2_ASM791703v3_PlH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_007917035.2_ASM791703v3_PlH.fna GCF_013046825.1_ASM1304682v1_genomic /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_013046825.1_ASM1304682v1_genomic.fna GCF_900604845.1_TTHNAR1_PlH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_900604845.1_TTHNAR1_PlH.fna GCF_900637195.1_50279_F01_genomic /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/GCF_900637195.1_50279_F01_genomic.fna Saccharolobus_solfataricus_PhH /mnt/volume/LongReads/simulatedMock/originalAssemblyFASTA_highDiversity/CPhPlH/Saccharolobus_solfataricus_PhH.fna

metadata.tsv: genome_ID OTU NCBI_ID novelty_category Ecoli_PhH 562 1045010 new_strain GCA_003086655.1_ASM308665v1_CPhH 4932 4932 new_strain GCA_014217455.1_ASM1421745v1_genomic 498019 498019 new_strain GCF_000006985.1_ASM698v1_genomic 1097 194439 new_strain GCF_000008625.1_ASM862v1_CPlH 63363 224324 new_strain GCF_000019965.1_ASM1996v1_genomic 107709 452637 new_strain GCF_000023865.1_ASM2386v1_CPhH 1852 471857 new_strain GCF_000024625.1_ASM2462v1_CPlH 73913 579137 new_strain GCF_000025685.1_ASM2568v1_CPhPlH 2246 309800 new_strain GCF_000144645.1_ASM14464v1_genomic 291220 555079 new_strain GCF_000172995.2_ASM17299v2_CPhPlH 60847 469382 new_strain GCF_000179575.2_ASM17957v2_CPlH 155863 647113 new_strain GCF_000183405.1_ASM18340v1_CPlH 477976 768670 new_strain GCF_000214355.1_ASM21435v1_genomic 150829 545695 new_strain GCF_000259255.1_ASM25925v1_CPlH 138563 182217 new_strain GCF_000473245.1_ASM47324v1_CPhH 324767 1367477 new_strain GCF_000739395.1_ASM73939v1_CPhH 96345 572261 new_strain GCF_000828635.1_ASM82863v1_genomic 748811 1223802 new_strain GCF_000833215.1_ASM83321v1_CPlH 28110 539329 new_strain GCF_001293145.1_ASM129314v1_PlH 216816 216816 new_strain GCF_001518775.1_ASM151877v1_PlH 644 644 new_strain GCF_001543105.1_ASM154310v1_CPhH 87541 525247 new_strain GCF_001549695.1_ASM154969v1_PlH 1355477 754504 new_strain GCF_001729945.1_ASM172994v1_genomic 1525 264732 new_strain GCF_002214545.1_ASM221454v1_CPhH 277988 277988 new_strain GCF_002346025.1_ASM234602v1_CPhH 13373 13373 new_strain GCF_002504385.1_ASM250438v1_genomic 2151 2151 new_strain GCF_003019985.1_ASM301998v1_genomic 182710 182710 new_strain GCF_003047065.1_ASM304706v1_CPhH 1579 1423717 new_strain GCF_003253775.1_ASM325377v1_CPhH 1769 1769 new_strain GCF_003491205.1_ASM349120v1_CPhPlH 146919 146919 new_strain GCF_003667725.1_ASM366772v1_PlH 1176649 1176649 new_strain GCF_003798325.1_ASM379832v1_CPhH 556499 556499 new_strain GCF_003971565.1_ASM397156v1_PlH 1578 47770 new_strain GCF_004168325.2_ASM416832v2_CPlH 542 264203 new_strain GCF_007794935.1_ASM779493v1_CPhPlH 210 210 new_strain GCF_007917035.2_ASM791703v3_PlH 1352 1352 new_strain GCF_013046825.1_ASM1304682v1_genomic 154288 154288 new_strain GCF_900604845.1_TTHNAR1_PlH 274 274 new_strain GCF_900637195.1_50279_F01_genomic 1866885 1791 new_strain Saccharolobus_solfataricus_PhH 2287 273057 new_strain

abundance.tsv: Ecoli_PhH 0.024390244 GCA_003086655.1_ASM308665v1_CPhH 0.024390244 GCA_014217455.1_ASM1421745v1_genomic 0.024390244 GCF_000006985.1_ASM698v1_genomic 0.024390244 GCF_000008625.1_ASM862v1_CPlH 0.024390244 GCF_000019965.1_ASM1996v1_genomic 0.024390244 GCF_000023865.1_ASM2386v1_CPhH 0.024390244 GCF_000024625.1_ASM2462v1_CPlH 0.024390244 GCF_000025685.1_ASM2568v1_CPhPlH 0.024390244 GCF_000144645.1_ASM14464v1_genomic 0.024390244 GCF_000172995.2_ASM17299v2_CPhPlH 0.024390244 GCF_000179575.2_ASM17957v2_CPlH 0.024390244 GCF_000183405.1_ASM18340v1_CPlH 0.024390244 GCF_000214355.1_ASM21435v1_genomic 0.024390244 GCF_000259255.1_ASM25925v1_CPlH 0.024390244 GCF_000473245.1_ASM47324v1_CPhH 0.024390244 GCF_000739395.1_ASM73939v1_CPhH 0.024390244 GCF_000828635.1_ASM82863v1_genomic 0.024390244 GCF_000833215.1_ASM83321v1_CPlH 0.024390244 GCF_001293145.1_ASM129314v1_PlH 0.024390244 GCF_001518775.1_ASM151877v1_PlH 0.024390244 GCF_001543105.1_ASM154310v1_CPhH 0.024390244 GCF_001549695.1_ASM154969v1_PlH 0.024390244 GCF_001729945.1_ASM172994v1_genomic 0.024390244 GCF_002214545.1_ASM221454v1_CPhH 0.024390244 GCF_002346025.1_ASM234602v1_CPhH 0.024390244 GCF_002504385.1_ASM250438v1_genomic 0.024390244 GCF_003019985.1_ASM301998v1_genomic 0.024390244 GCF_003047065.1_ASM304706v1_CPhH 0.024390244 GCF_003253775.1_ASM325377v1_CPhH 0.024390244 GCF_003491205.1_ASM349120v1_CPhPlH 0.024390244 GCF_003667725.1_ASM366772v1_PlH 0.024390244 GCF_003798325.1_ASM379832v1_CPhH 0.024390244 GCF_003971565.1_ASM397156v1_PlH 0.024390244 GCF_004168325.2_ASM416832v2_CPlH 0.024390244 GCF_007794935.1_ASM779493v1_CPhPlH 0.024390244 GCF_007917035.2_ASM791703v3_PlH 0.024390244 GCF_013046825.1_ASM1304682v1_genomic 0.024390244 GCF_900604845.1_TTHNAR1_PlH 0.024390244 GCF_900637195.1_50279_F01_genomic 0.024390244 Saccharolobus_solfataricus_PhH 0.024390244

Most of the sequences are easily recorded and processed by CAMISIM; the following error occurs only with a few:

ERROR message Traceback (most recent call last): File "/mnt/volume/LongReads/Tools/NanoSim/src/simulator.py", line 716, in main() File "/mnt/volume/LongReads/Tools/NanoSim/src/simulator.py", line 710, in main simulation(ref, out, dna_type, perfect, kmer_bias, max_readlength, min_readlength) File "/mnt/volume/LongReads/Tools/NanoSim/src/simulator.py", line 284, in simulation read_mutated = mutate_read(new_read, new_read_name, out_error, error_dict, kmer_bias, False) File "/mnt/volume/LongReads/Tools/NanoSim/src/simulator.py", line 577, in mutate_read tmp_bases.remove(read[key + i]) IndexError: string index out of range Traceback (most recent call last): File "/mnt/volume/LongReads/Tools/NanoSim/src/simulator.py", line 716, in main() File "/mnt/volume/LongReads/Tools/NanoSim/src/simulator.py", line 710, in main simulation(ref, out, dna_type, perfect, kmer_bias, max_readlength, min_readlength) File "/mnt/volume/LongReads/Tools/NanoSim/src/simulator.py", line 377, in simulation read_mutated = mutate_read(new_read, new_read_name, out_error, error_dict, kmer_bias) File "/mnt/volume/LongReads/Tools/NanoSim/src/simulator.py", line 577, in mutate_read tmp_bases.remove(read[key + i]) IndexError: string index out of range Traceback (most recent call last): File "/mnt/volume/LongReads/Tools/NanoSim/src/simulator.py", line 716, in main() File "/mnt/volume/LongReads/Tools/NanoSim/src/simulator.py", line 710, in main simulation(ref, out, dna_type, perfect, kmer_bias, max_readlength, min_readlength) File "/mnt/volume/LongReads/Tools/NanoSim/src/simulator.py", line 284, in simulation read_mutated = mutate_read(new_read, new_read_name, out_error, error_dict, kmer_bias, False) File "/mnt/volume/LongReads/Tools/NanoSim/src/simulator.py", line 577, in mutate_read tmp_bases.remove(read[key + i]) IndexError: string index out of range 2021-01-19 17:45:23 ERROR: [GenomePreparation 7296803792] 3 commands returned errors! Task failed with return code: 1, task: /mnt/volume/LongReads/Tools/NanoSim/src/simulator.py linear -n 19755 -r /mnt/volume/LongReads/simulatedMock/CAMISIM/out_nanosim_CPhPlEH_6Gb_seed202102070000000/source_genomes/GCF_003971565.1_ASM397156v1_PlH.fna -o /mnt/volume/LongReads/simulatedMock/CAMISIM/tmp/tmpt9Z1RD/2021.01.19_17.23.29_sample_0/reads/GCF_003971565.1_ASM397156v1_PlH -c tools/nanosim_profile/ecoli --seed 4060996387 Task failed with return code: 1, task: /mnt/volume/LongReads/Tools/NanoSim/src/simulator.py linear -n 19755 -r /mnt/volume/LongReads/simulatedMock/CAMISIM/out_nanosim_CPhPlEH_6Gb_seed202102070000000/source_genomes/GCF_001293145.1_ASM129314v1_PlH.fna -o /mnt/volume/LongReads/simulatedMock/CAMISIM/tmp/tmpt9Z1RD/2021.01.19_17.23.29_sample_0/reads/GCF_001293145.1_ASM129314v1_PlH -c tools/nanosim_profile/ecoli --seed 1287782089 Task failed with return code: 1, task: /mnt/volume/LongReads/Tools/NanoSim/src/simulator.py linear -n 19755 -r /mnt/volume/LongReads/simulatedMock/CAMISIM/out_nanosim_CPhPlEH_6Gb_seed202102070000000/source_genomes/Saccharolobus_solfataricus_PhH.fna -o /mnt/volume/LongReads/simulatedMock/CAMISIM/tmp/tmpt9Z1RD/2021.01.19_17.23.29_sample_0/reads/Saccharolobus_solfataricus_PhH -c tools/nanosim_profile/ecoli --seed 2012999401

so my question would be: where might be the problem leading to an "IndexError: string index out of range" and how can I make to run CAMISIM for all my sequences?

Best thanks in advance and kind regards Franzi

AlphaSquad commented 3 years ago

Hi Franzi,

thanks for you interest and your detailed description. From what you posted it seems like NanoSim is the culprit and the crashes occur within the simulation there. Since it is a string index out of range as you point out, my main idea what is going wrong is the length of some sequences in the fasta files. The NanoSim model used in CAMISIM was trained on a real data set and the average read size is ~7,500 bases. I don't know what NanoSim will do if there are sequences in the fasta files which are shorter than a read which is to be simulated. Could you check whether the particular files which crashed contain some short sequences and whether it is always the same files (i.e. if you left these fasta files out, would CAMISIM finish the dataset)? That would be appreciated. I tested this one exemplarily: GCF_001293145.1_ASM129314v1_PlH and it has a plasmid of length 2638 which might cause problems. If this turns out to be true there is unfortunately not much I can do - either NanoSim would need a change, or these short contigs would need to be removed from the fasta files. Thanks, Adrian

FranziskaErber commented 3 years ago

Hello Adrian, Thank you very much for your quick reply and help! In fact, two of the files that crashed contain sequences <7500. I think the sequence length could be the problem here. However, there were also problems with sequences of length 16662 and 26097, here errors might happen by chance? Even when these problematic short sequences are in the data set: the simulation with CAMISIM works well for the rest of the data set, only the few files are not used and don't produce any output. Thank you Franzi

AlphaSquad commented 3 years ago

You could test the NanoSim run for the third file manually with a different seed to see whether it works. If it consistently crashes, then the problem might be something else.

FranziskaErber commented 3 years ago

I've already used different seeds varying from 10 to 20000000 and 200000000000000. For a run using the "problematic" sequences seperately I also tried a small seed of 1. But Nanosim still didn't analyse the 26097 or the 16662 bases long sequences. But there was an output without any error for a 4184 bases long sequence.