Closed FranziskaErber closed 3 years ago
Hi Franzi,
thanks for you interest and your detailed description. From what you posted it seems like NanoSim is the culprit and the crashes occur within the simulation there. Since it is a string index out of range as you point out, my main idea what is going wrong is the length of some sequences in the fasta files. The NanoSim model used in CAMISIM was trained on a real data set and the average read size is ~7,500 bases. I don't know what NanoSim will do if there are sequences in the fasta files which are shorter than a read which is to be simulated. Could you check whether the particular files which crashed contain some short sequences and whether it is always the same files (i.e. if you left these fasta files out, would CAMISIM finish the dataset)? That would be appreciated. I tested this one exemplarily: GCF_001293145.1_ASM129314v1_PlH and it has a plasmid of length 2638 which might cause problems. If this turns out to be true there is unfortunately not much I can do - either NanoSim would need a change, or these short contigs would need to be removed from the fasta files. Thanks, Adrian
Hello Adrian, Thank you very much for your quick reply and help! In fact, two of the files that crashed contain sequences <7500. I think the sequence length could be the problem here. However, there were also problems with sequences of length 16662 and 26097, here errors might happen by chance? Even when these problematic short sequences are in the data set: the simulation with CAMISIM works well for the rest of the data set, only the few files are not used and don't produce any output. Thank you Franzi
You could test the NanoSim run for the third file manually with a different seed to see whether it works. If it consistently crashes, then the problem might be something else.
I've already used different seeds varying from 10 to 20000000 and 200000000000000. For a run using the "problematic" sequences seperately I also tried a small seed of 1. But Nanosim still didn't analyse the 26097 or the 16662 bases long sequences. But there was an output without any error for a 4184 bases long sequence.
Hello dear @AlphaSquad, I would like to perform simulations with CAMISIM to generate nanopore data; I was able to download CAMISIM and ran the script with the default files successfully. Fortunately this also worked fine when using 6 sequence files. Now I would like to use around 35 FASTA sequence files [.fna] for this. They vary in composition: the files can contain chromosomal, plasmid and/or phagic DNA sequences with different numbers. I provide the following:
Most of the sequences are easily recorded and processed by CAMISIM; the following error occurs only with a few:
so my question would be: where might be the problem leading to an "IndexError: string index out of range" and how can I make to run CAMISIM for all my sequences?
Best thanks in advance and kind regards Franzi