HadrienG / InSilicoSeq

:rocket: A sequencing simulator
https://insilicoseq.readthedocs.io
MIT License
184 stars 32 forks source link

iss generate simulates fewer reads than requested #194

Closed sergioSEa closed 3 years ago

sergioSEa commented 3 years ago

Dear Hadrien,

Thank you very much for developing this tool. I had been using it for some time when I realised that the number of reads that the tool is generating does not correspond to the number of reads I specify with n_reads. I did realised about this issue using some complex designs, but I still see the same behaviour in the simpler cases. For instance, let's use this reference genome as an example: https://www.ncbi.nlm.nih.gov/assembly/GCF_005845145.1

Here we have multiple contigs, so in principle I would like to use the "--draft" command, but we can also assume that each of the contigs is in fact a complete assembly, and provide the "--genome" command. The result is the same.

Here there are some examples:

Using the bioconda installation (v1.5.1)

iss generate --n_reads 1000 --draft Bacteroides_ovatus.fa  --model hiseq --output TEST_sim --seed 10 --cpus 1
wc -l TEST_sim_R1.fastq
1936

(assuming that the 1,000 reads are divided between the two files, I would expect to get 2,000 lines)

If I repeat the process with the --genome command:

wc -l TEST_sim_R1.fastq
1916

I wondered if it was due to the conda version, so I tried directly with the Git repository.

python InSilicoSeq/iss/__main__.py -n_reads 1000 --draft Bacteroides_ovatus.fa  --model hiseq --output TEST_sim --seed 10 --cpus 1 
wc -l TEST_sim_R1.fastq
1936

This is an example in which there were not that many reads missing, but I have run this over multiple files and I find a bigger percentage of missing reads. Any clue of what is going on? Also tried with --gc_bias and there is no difference.

Here there is the list of installed python packages, which should be the only dependence. Python itself is v3.7.7

pip list:

Package      Version
------------ -------------------
biopython    1.77
brotlipy     0.7.0
certifi      2020.6.20
cffi         1.14.2
chardet      3.0.4
cryptography 3.0
future       0.18.2
idna         2.10
InSilicoSeq  1.5.1
joblib       0.16.0
mkl-fft      1.1.0
mkl-random   1.1.1
mkl-service  2.3.0
numpy        1.19.1
pip          20.2.2
pycparser    2.20
pyOpenSSL    19.1.0
pysam        0.15.3
PySocks      1.7.1
requests     2.24.0
scipy        1.5.2
setuptools   49.6.0.post20200814
six          1.15.0
urllib3      1.25.10
wheel        0.35.1

Thank you very much for your help,

Cheers,

Sergio.