Open yanhui09 opened 1 year ago
Can you please report your version number and the exact command you used? How different is it from the expected number of reads?
I was using the latest version, installed from conda.
nanosim 3.1.0 hdfd78af_0 bioconda
I simulated reads from 100 to 10000. The number of total simulated reads (aligned/unaligned) is around 60-80% of the set value of --number. Below are the examples for -n 100
and -n 10000
-n 100
(base) [yanhui@g-12-l0002 simulate]$ seqkit stat 80_85_100/simulated_aligned_reads.fastq
file format type num_seqs sum_len min_len avg_len max_len
80_85_100/simulated_aligned_reads.fastq FASTQ DNA 78 103,982 1,023 1,333.1 1,522
(base) [yanhui@g-12-l0002 simulate]$ seqkit stat 80_85_100/simulated_unaligned_reads.fastq
file format type num_seqs sum_len min_len avg_len max_len
80_85_100/simulated_unaligned_reads.fastq FASTQ DNA 2 2,435 1,170 1,217.5 1,265
-n 10000
(base) [yanhui@g-12-l0002 simulate]$ seqkit stat 80_85_10000/simulated_aligned_reads.fastq
file format type num_seqs sum_len min_len avg_len max_len
80_85_10000/simulated_aligned_reads.fastq FASTQ DNA 6,191 8,233,481 616 1,329.9 1,522
(base) [yanhui@g-12-l0002 simulate]$ seqkit stat 80_85_10000/simulated_unaligned_reads.fastq
file format type num_seqs sum_len min_len avg_len max_len
80_85_10000/simulated_unaligned_reads.fastq FASTQ DNA 157 196,411 882 1,251 1,509
BTW,
(base) [yanhui@g-12-l0002 nanosim]$ head model/training_reads_alignment_rate
Aligned / Unaligned ratio: 62.77246376811594
nanosim
is included in one snakemake workflow.
The exact used command shall be as below:
MODEL=nanosim/model/training_model_profile
c=${MODEL//_model_profile/}
simulator.py genome -rg nanosim/silva/cls_ref/id_80_85/ref.fasta -c "$c" -o nanosim/silva/simulate/97_99_1000/simulated -n 100 -b guppy --fastq -t 20 -dna_type linear --seed 1234
I have checked the log. It looks the same as https://github.com/bcgsc/NanoSim/issues/114#issue-839940442
If I met IndexError: list index out of range
as above, the total counts are less than the set value.
If it runs without errors, the output is as expected.
Some of my references are highly similar. It may not be able to simulate adequate reads if the value is set too high. Will this be a possible explanation?
Also, can I still trust the simulated reads from IndexError
(although they are a bit less than expected)?
@kmnip
Thanks for the answers in advance.
Hey @yanhui09 Sorry for late reply and thank you for providing comprehensive information regarding your run and also looking into the issue in detail.
As far as I remember, the "IndexError: list index out of range" was a bug related to setting -min and -max parameters when simulating (#118) and I believe it is fixed now and I am surprised you encountered the similar error using the latest version. Do you get the IndexError exactly at the same step reported in #114 (head_vs_ht_ratio step)?
One possible reason for not getting expected number of simulated reads might be that the reference genome you are using is smaller than the read lengths you are trying to simulate. I would love to hear @kmnip thoughts on this as well.
If I met IndexError: list index out of range as above, the total counts are not less than the set value. If it runs with errors, the output is as expected.
Just to confirm, you are saying that when you encounter IndexError, the output simulated read counts are equal to expected But if there is no error given, the output simulated counts are NOT equal? Is that correct? How many times did you run the simulator? I wonder if you experienced similar outcomes from like 10 runs?
@SaberHQ Sorry I made some "wrong" sayings, it shall be
If I met
IndexError: list index out of range
as above, the total counts are less than the set value. The script can still finish and deliver the three outputs. And theindexError
is reported inhead_vs_ht_ratio
step as below.Traceback (most recent call last): File "/home/projects/cu_10168/people/yanhui/database/Kamp/conda_envs/54c244341bed953ee141f3c4bbfdfbea_/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/projects/cu_10168/people/yanhui/database/Kamp/conda_envs/54c244341bed953ee141f3c4bbfdfbea_/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/projects/cu_10168/people/yanhui/database/Kamp/conda_envs/54c244341bed953ee141f3c4bbfdfbea_/bin/simulator.py", line 1294, in simulation_aligned_genome head_vs_ht_ratio = head_vs_ht_ratio_list[each_read] IndexError: list index out of range
If it runs without errors, the output is as expected. When I simulate reads with log-normal distribution using
-med
and-sd
, there are no index errors for a relatively smaller-n
.
Actually, I don't set values for -min
and -max
, and the reference is 1.5kb, which is longer than the default min read length of 50 bp. But I kept getting the indexError
if I don't use -med
and -sd
.
And my reference is a fasta
file of multiple sequences in the same length but with different divergence. Will this affect the simulation?
Hi. I'm getting the same error. Were you able to resolve it @yanhui09 yanhui09?
I used this
#read simulation python /home/apps/NanoSim/src/simulator.py genome \ -rg $ref/pAL1128.fasta \ -c $outdir/BC${i}_training \ -o $outdir/BC${i}_simulation \ -n 20000 \ -max 2500 \ -min 1000 \ --strandness 1 \ --seed 100 \ -b guppy \ --fastq \ -dna_type linear \ -t 16
and got this:
Traceback (most recent call last): File "/home/nina/mambaforge/envs/nanosim/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/nina/mambaforge/envs/nanosim/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, **self._kwargs) File "/home/apps/NanoSim/src/simulator.py", line 1309, in simulation_aligned_genome head_vs_ht_ratio = head_vs_ht_ratio_list[each_read] IndexError: list index out of range
Thank you in advance for your help.
Dear @npdungca Would you please confirm which version are you using? (I suggest you use the latest committed version from Github by the way). I will publish a new release within the next couple of weeks, so you might also try that if you want. It will contain a bunch of bug fixes.
Would you please also confirm whether you get the same error when not setting the min and max arguments? That will help us to narrow down the error source.
Thanks. Saber.
Hi, Saber.
I'm using the latest version from github. I just downloaded and installed nanosim last week. I'm getting this error message when running it without the min and max arguments:
2023-11-12 12:30:41: /home/apps/NanoSim/src/simulator.py genome -rg /data/Nina_Test/refs/pAL1128.fasta -c /data/Nina_Test/CMV/output/SSenr5_Altext/BAC_dilutions_duplex/BC10_training_no-model -o /data/Nina_Test/CMV/output/SSenr5_Altext/BAC_dilutions_duplex/BC10_simulation_nosizeparams -n 20000 --strandness 1 --seed 100 -b guppy --fastq -dna_type linear -t 8 2023-11-12 12:30:41: Read in reference 2023-11-12 12:30:41: Read error profile Traceback (most recent call last): File "/home/apps/NanoSim/src/simulator.py", line 2433, in <module> main() File "/home/apps/NanoSim/src/simulator.py", line 2194, in main read_profile(ref_g, number, model_prefix, perfect, args.mode, strandness, dna_type=dna_type, chimeric=chimeric) File "/home/apps/NanoSim/src/simulator.py", line 474, in read_profile with open(model_profile, 'r') as mod_profile: FileNotFoundError: [Errno 2] No such file or directory: '/data/Nina_Test/CMV/output/SSenr5_Altext/BAC_dilutions_duplex/BC10_training_no-model_model_profile' Command exited with non-zero status
Below are my commands: `#characterization of data set python /home/apps/NanoSim/src/read_analysis.py genome \ -i $outdir/BC${i}_duplex_CMVmapped.fastq \ -rg $ref/pAL1128.fasta \ -a minimap2 \ -o $outdir/BC${i}_training_no-model \ --no_model_fit \ -t 8
python /home/apps/NanoSim/src/simulator.py genome \ -rg $ref/pAL1128.fasta \ -c $outdir/BC${i}_training_no-model \ -o $outdir/BC${i}_simulation_nosizeparams \ -n 20000 \ --strandness 1 \ --seed 100 \ -b guppy \ --fastq \ -dna_type linear \ -t 8 ` Thanks again for your help.
@npdungca I couldn't resolve it. 😢 Instead I chose badread for my simulation.
Got it. I'll try that. Thank you, @yanhui09. I'll also wait for and try the latest version of nanosim. It might resolve it :)
Hi, Saber.
I'm using the latest version from github. I just downloaded and installed nanosim last week. I'm getting this error message when running it without the min and max arguments:
2023-11-12 12:30:41: /home/apps/NanoSim/src/simulator.py genome -rg /data/Nina_Test/refs/pAL1128.fasta -c /data/Nina_Test/CMV/output/SSenr5_Altext/BAC_dilutions_duplex/BC10_training_no-model -o /data/Nina_Test/CMV/output/SSenr5_Altext/BAC_dilutions_duplex/BC10_simulation_nosizeparams -n 20000 --strandness 1 --seed 100 -b guppy --fastq -dna_type linear -t 8 2023-11-12 12:30:41: Read in reference 2023-11-12 12:30:41: Read error profile Traceback (most recent call last): File "/home/apps/NanoSim/src/simulator.py", line 2433, in <module> main() File "/home/apps/NanoSim/src/simulator.py", line 2194, in main read_profile(ref_g, number, model_prefix, perfect, args.mode, strandness, dna_type=dna_type, chimeric=chimeric) File "/home/apps/NanoSim/src/simulator.py", line 474, in read_profile with open(model_profile, 'r') as mod_profile: FileNotFoundError: [Errno 2] No such file or directory: '/data/Nina_Test/CMV/output/SSenr5_Altext/BAC_dilutions_duplex/BC10_training_no-model_model_profile' Command exited with non-zero status
Below are my commands: `#characterization of data set python /home/apps/NanoSim/src/read_analysis.py genome -i outdir/BC{i}_duplex_CMVmapped.fastq -rg $ref/pAL1128.fasta -a minimap2 -o outdir/BC{i}_training_no-model --no_model_fit -t 8
read simulation python /home/apps/NanoSim/src/simulator.py genome -rg $ref/pAL1128.fasta -c outdir/BC{i}_training_no-model -o outdir/BC{i}_simulation_nosizeparams -n 20000 --strandness 1 --seed 100 -b guppy --fastq -dna_type linear -t 8 ` Thanks again for your help.
Dear @npdungca the error you reported here is not related to the IndexError
mentioned earlier. It is FileNotFoundError
meaning that the simulator couldn't read the required file. Please double-check the directory you are using.
The next release might happen within the next month or so. Meanwhile, try the latest committed version. Thanks.
Hi,
Thanks for the software to simulate nanopore reads. I get a bit confused while using the simulator.py with --number.
From the help message, --number suggests the number of reads to be simulated. However, I counted the number of the unaligned/aligned fastq sequences in the simulation output. The total number was not equal to the set "-n" value even when I added them up.
Do I miss something? Why are they not equal?
Thank you very much for the answers in advance. Yan