yanhui09 commented 1 year ago

Hi,

Thanks for the software to simulate nanopore reads. I get a bit confused while using the simulator.py with --number.

From the help message, --number suggests the number of reads to be simulated. However, I counted the number of the unaligned/aligned fastq sequences in the simulation output. The total number was not equal to the set "-n" value even when I added them up.

Do I miss something? Why are they not equal?

Thank you very much for the answers in advance. Yan

kmnip commented 1 year ago

Can you please report your version number and the exact command you used? How different is it from the expected number of reads?

yanhui09 commented 1 year ago

I was using the latest version, installed from conda.

nanosim                   3.1.0                hdfd78af_0    bioconda

I simulated reads from 100 to 10000. The number of total simulated reads (aligned/unaligned) is around 60-80% of the set value of --number. Below are the examples for -n 100 and -n 10000

-n 100

(base) [yanhui@g-12-l0002 simulate]$ seqkit stat  80_85_100/simulated_aligned_reads.fastq 
file                                     format  type  num_seqs  sum_len  min_len  avg_len  max_len
80_85_100/simulated_aligned_reads.fastq  FASTQ   DNA         78  103,982    1,023  1,333.1    1,522
(base) [yanhui@g-12-l0002 simulate]$ seqkit stat  80_85_100/simulated_unaligned_reads.fastq 
file                                       format  type  num_seqs  sum_len  min_len  avg_len  max_len
80_85_100/simulated_unaligned_reads.fastq  FASTQ   DNA          2    2,435    1,170  1,217.5    1,265

-n 10000

(base) [yanhui@g-12-l0002 simulate]$ seqkit stat  80_85_10000/simulated_aligned_reads.fastq 
file                                       format  type  num_seqs    sum_len  min_len  avg_len  max_len
80_85_10000/simulated_aligned_reads.fastq  FASTQ   DNA      6,191  8,233,481      616  1,329.9    1,522
(base) [yanhui@g-12-l0002 simulate]$ seqkit stat  80_85_10000/simulated_unaligned_reads.fastq 
file                                         format  type  num_seqs  sum_len  min_len  avg_len  max_len
80_85_10000/simulated_unaligned_reads.fastq  FASTQ   DNA        157  196,411      882    1,251    1,509

BTW,

I trained the error model by myself. Below is the training reads alignments rates

(base) [yanhui@g-12-l0002 nanosim]$ head model/training_reads_alignment_rate 
Aligned / Unaligned ratio:      62.77246376811594

The ref is a multi-sequence fasta file. The dna-type is linear. I sticked to default for the rest.

yanhui09 commented 1 year ago

nanosim is included in one snakemake workflow. The exact used command shall be as below:

MODEL=nanosim/model/training_model_profile
c=${MODEL//_model_profile/}
simulator.py genome -rg nanosim/silva/cls_ref/id_80_85/ref.fasta -c "$c" -o nanosim/silva/simulate/97_99_1000/simulated -n 100 -b guppy --fastq -t 20 -dna_type linear --seed 1234

yanhui09 commented 1 year ago

I have checked the log. It looks the same as https://github.com/bcgsc/NanoSim/issues/114#issue-839940442

If I met IndexError: list index out of range as above, the total counts are less than the set value. If it runs without errors, the output is as expected.

Some of my references are highly similar. It may not be able to simulate adequate reads if the value is set too high. Will this be a possible explanation?

Also, can I still trust the simulated reads from IndexError (although they are a bit less than expected)?

@kmnip

Thanks for the answers in advance.

SaberHQ commented 1 year ago

Hey @yanhui09 Sorry for late reply and thank you for providing comprehensive information regarding your run and also looking into the issue in detail.

As far as I remember, the "IndexError: list index out of range" was a bug related to setting -min and -max parameters when simulating (#118) and I believe it is fixed now and I am surprised you encountered the similar error using the latest version. Do you get the IndexError exactly at the same step reported in #114 (head_vs_ht_ratio step)?

One possible reason for not getting expected number of simulated reads might be that the reference genome you are using is smaller than the read lengths you are trying to simulate. I would love to hear @kmnip thoughts on this as well.

If I met IndexError: list index out of range as above, the total counts are not less than the set value. If it runs with errors, the output is as expected.

Just to confirm, you are saying that when you encounter IndexError, the output simulated read counts are equal to expected But if there is no error given, the output simulated counts are NOT equal? Is that correct? How many times did you run the simulator? I wonder if you experienced similar outcomes from like 10 runs?

yanhui09 commented 1 year ago

@SaberHQ Sorry I made some "wrong" sayings, it shall be

If I met IndexError: list index out of range as above, the total counts are less than the set value. The script can still finish and deliver the three outputs. And the indexError is reported in head_vs_ht_ratio step as below.
Traceback (most recent call last):
File "/home/projects/cu_10168/people/yanhui/database/Kamp/conda_envs/54c244341bed953ee141f3c4bbfdfbea_/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/projects/cu_10168/people/yanhui/database/Kamp/conda_envs/54c244341bed953ee141f3c4bbfdfbea_/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/projects/cu_10168/people/yanhui/database/Kamp/conda_envs/54c244341bed953ee141f3c4bbfdfbea_/bin/simulator.py", line 1294, in simulation_aligned_genome
head_vs_ht_ratio = head_vs_ht_ratio_list[each_read]
IndexError: list index out of range
If it runs without errors, the output is as expected. When I simulate reads with log-normal distribution using -med and -sd, there are no index errors for a relatively smaller -n .

Actually, I don't set values for -min and -max, and the reference is 1.5kb, which is longer than the default min read length of 50 bp. But I kept getting the indexError if I don't use -med and -sd.

And my reference is a fasta file of multiple sequences in the same length but with different divergence. Will this affect the simulation?

npdungca commented 7 months ago

Hi. I'm getting the same error. Were you able to resolve it @yanhui09 yanhui09?

I used this #read simulation python /home/apps/NanoSim/src/simulator.py genome \ -rg $ref/pAL1128.fasta \ -c $outdir/BC${i}_training \ -o $outdir/BC${i}_simulation \ -n 20000 \ -max 2500 \ -min 1000 \ --strandness 1 \ --seed 100 \ -b guppy \ --fastq \ -dna_type linear \ -t 16 and got this: Traceback (most recent call last): File "/home/nina/mambaforge/envs/nanosim/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/nina/mambaforge/envs/nanosim/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, **self._kwargs) File "/home/apps/NanoSim/src/simulator.py", line 1309, in simulation_aligned_genome head_vs_ht_ratio = head_vs_ht_ratio_list[each_read] IndexError: list index out of range

Thank you in advance for your help.

SaberHQ commented 7 months ago

Dear @npdungca Would you please confirm which version are you using? (I suggest you use the latest committed version from Github by the way). I will publish a new release within the next couple of weeks, so you might also try that if you want. It will contain a bunch of bug fixes.

Would you please also confirm whether you get the same error when not setting the min and max arguments? That will help us to narrow down the error source.

Thanks. Saber.

npdungca commented 7 months ago

Hi, Saber.

I'm using the latest version from github. I just downloaded and installed nanosim last week. I'm getting this error message when running it without the min and max arguments: 2023-11-12 12:30:41: /home/apps/NanoSim/src/simulator.py genome -rg /data/Nina_Test/refs/pAL1128.fasta -c /data/Nina_Test/CMV/output/SSenr5_Altext/BAC_dilutions_duplex/BC10_training_no-model -o /data/Nina_Test/CMV/output/SSenr5_Altext/BAC_dilutions_duplex/BC10_simulation_nosizeparams -n 20000 --strandness 1 --seed 100 -b guppy --fastq -dna_type linear -t 8 2023-11-12 12:30:41: Read in reference 2023-11-12 12:30:41: Read error profile Traceback (most recent call last): File "/home/apps/NanoSim/src/simulator.py", line 2433, in <module> main() File "/home/apps/NanoSim/src/simulator.py", line 2194, in main read_profile(ref_g, number, model_prefix, perfect, args.mode, strandness, dna_type=dna_type, chimeric=chimeric) File "/home/apps/NanoSim/src/simulator.py", line 474, in read_profile with open(model_profile, 'r') as mod_profile: FileNotFoundError: [Errno 2] No such file or directory: '/data/Nina_Test/CMV/output/SSenr5_Altext/BAC_dilutions_duplex/BC10_training_no-model_model_profile' Command exited with non-zero status

Below are my commands: `#characterization of data set python /home/apps/NanoSim/src/read_analysis.py genome \ -i $outdir/BC${i}_duplex_CMVmapped.fastq \ -rg $ref/pAL1128.fasta \ -a minimap2 \ -o $outdir/BC${i}_training_no-model \ --no_model_fit \ -t 8

read simulation

python /home/apps/NanoSim/src/simulator.py genome \ -rg $ref/pAL1128.fasta \ -c $outdir/BC${i}_training_no-model \ -o $outdir/BC${i}_simulation_nosizeparams \ -n 20000 \ --strandness 1 \ --seed 100 \ -b guppy \ --fastq \ -dna_type linear \ -t 8 ` Thanks again for your help.

yanhui09 commented 7 months ago

@npdungca I couldn't resolve it. 😢 Instead I chose badread for my simulation.

npdungca commented 7 months ago

Got it. I'll try that. Thank you, @yanhui09. I'll also wait for and try the latest version of nanosim. It might resolve it :)

SaberHQ commented 4 months ago

Hi, Saber.

I'm using the latest version from github. I just downloaded and installed nanosim last week. I'm getting this error message when running it without the min and max arguments: 2023-11-12 12:30:41: /home/apps/NanoSim/src/simulator.py genome -rg /data/Nina_Test/refs/pAL1128.fasta -c /data/Nina_Test/CMV/output/SSenr5_Altext/BAC_dilutions_duplex/BC10_training_no-model -o /data/Nina_Test/CMV/output/SSenr5_Altext/BAC_dilutions_duplex/BC10_simulation_nosizeparams -n 20000 --strandness 1 --seed 100 -b guppy --fastq -dna_type linear -t 8 2023-11-12 12:30:41: Read in reference 2023-11-12 12:30:41: Read error profile Traceback (most recent call last): File "/home/apps/NanoSim/src/simulator.py", line 2433, in <module> main() File "/home/apps/NanoSim/src/simulator.py", line 2194, in main read_profile(ref_g, number, model_prefix, perfect, args.mode, strandness, dna_type=dna_type, chimeric=chimeric) File "/home/apps/NanoSim/src/simulator.py", line 474, in read_profile with open(model_profile, 'r') as mod_profile: FileNotFoundError: [Errno 2] No such file or directory: '/data/Nina_Test/CMV/output/SSenr5_Altext/BAC_dilutions_duplex/BC10_training_no-model_model_profile' Command exited with non-zero status

Below are my commands: `#characterization of data set python /home/apps/NanoSim/src/read_analysis.py genome -i outdir/BC{i}_duplex_CMVmapped.fastq -rg $ref/pAL1128.fasta -a minimap2 -o outdir/BC{i}_training_no-model --no_model_fit -t 8

read simulation python /home/apps/NanoSim/src/simulator.py genome -rg $ref/pAL1128.fasta -c outdir/BC{i}_training_no-model -o outdir/BC{i}_simulation_nosizeparams -n 20000 --strandness 1 --seed 100 -b guppy --fastq -dna_type linear -t 8 ` Thanks again for your help.

Dear @npdungca the error you reported here is not related to the IndexError mentioned earlier. It is FileNotFoundError meaning that the simulator couldn't read the required file. Please double-check the directory you are using.

The next release might happen within the next month or so. Meanwhile, try the latest committed version. Thanks.

bcgsc / NanoSim

The meaning of --number in simulator.py #170

read simulation

read simulation python /home/apps/NanoSim/src/simulator.py genome -rg $ref/pAL1128.fasta -c outdir/BC{i}_training_no-model -o outdir/BC{i}_simulation_nosizeparams -n 20000 --strandness 1 --seed 100 -b guppy --fastq -dna_type linear -t 8 ` Thanks again for your help.