HadrienG / InSilicoSeq

:rocket: A sequencing simulator
https://insilicoseq.readthedocs.io
MIT License
184 stars 32 forks source link

zlib.error #132

Closed ma-celik closed 4 years ago

ma-celik commented 4 years ago

Greeting Hadrien, i have been trying to install with pip and pip3, although it says install is complete when i try to run it iss command not found. I have installed it with conda and im running this command iss generate --ncbi bacteria --n_genomes_ncbi 10 --model hiseq --output miseq_ncbi and it gives me the below output, the command works just fine on my other computer which i have ubuntu 14.04 while this one has ubuntu 18.04 INFO:iss.app:Starting iss generate INFO:iss.app:Using kde ErrorModel INFO:iss.download:Searching for bacteria to download INFO:iss.download:Downloading GCF_000204155.1 INFO:iss.download:Downloading GCF_008247605.1 INFO:iss.download:Downloading GCF_002591135.1 INFO:iss.download:Downloading GCF_008245065.1 INFO:iss.download:Downloading GCF_001735765.2 INFO:iss.download:Downloading GCF_002447735.1 INFO:iss.download:Downloading GCF_005221965.1 INFO:iss.download:Downloading GCF_900638625.1 INFO:iss.download:Downloading GCF_900322585.1 Traceback (most recent call last): File "/home/celik/anaconda3/envs/my_env/bin/iss", line 10, in sys.exit(main()) File "/home/celik/anaconda3/envs/my_env/lib/python3.7/site-packages/iss/app.py", line 542, in main args.func(args) File "/home/celik/anaconda3/envs/my_env/lib/python3.7/site-packages/iss/app.py", line 111, in generate_reads g, n, args.output + '_ncbi_genomes.fasta') File "/home/celik/anaconda3/envs/my_env/lib/python3.7/site-packages/iss/download.py", line 51, in ncbi assembly_to_fasta(url, output) File "/home/celik/anaconda3/envs/my_env/lib/python3.7/site-packages/iss/download.py", line 75, in assembly_to_fasta request.content, zlib.MAX_WBITS | 32).decode() zlib.error: Error -3 while decompressing data: incorrect header check

Can you help me out with this?

HadrienG commented 4 years ago

Hi!

Firstly let's see what's up with the pip issue:

i have been trying to install with pip and pip3, although it says install is complete when i try to run it iss command not found

How have you installed python? from apt-get? If so, I think (I don't have ubuntu so you're gonna have to try this yourself) pip does not install your packages in /usr/bin but in /usr/local/lib/pythonX.x/bin/

Can you check if the directory above in in your PATH? If not, adding it will solve the issue.

Secondly, for the download issue:

request.content, zlib.MAX_WBITS | 32).decode() zlib.error: Error -3 while decompressing data: incorrect header check

This means that InSilicoSeq is trying to decompress (or, before decompressing, checking the data format) headers that are not there. I'd say it's unlikely it's a bug since zlib.MAX_WBITS | 32 automatically detects which compression format is used.

I could not reproduce the problem (on my Mac though), but I've encountered download issues in the past. I think this might happen if a file does not get downloaded properly?

Does this happen every time you run InSilicoSeq? If it consistently does not work, I'll test on ubuntu 18.x

Hope that helps, /Hadrien

ma-celik commented 4 years ago

Thanks for the quick reply Hadrien, I really appreciated. Uhm yeah it was not in my path.. Sorry about this. Anyhow im running into this after adding pip directory to PATH.. "Traceback (most recent call last): File "/home/celik/.local/bin/iss", line 7, in from iss.app import main SyntaxError: can not delete variable 'record' referenced in nested scope" But when i run it on conda env. it gives me the same error, i have tried it on an another computer with ubuntu 18.04, no change.

HadrienG commented 4 years ago
"Traceback (most recent call last):
File "/home/celik/.local/bin/iss", line 7, in
from iss.app import main
SyntaxError: can not delete variable 'record' referenced in nested scope"

It seems that a python >3.2 feature slipped past the tests (deleting variables in nested scope). I was officially gonna drop python 2 support in one month, but well... make sure you are installing the python3 version. You might need to use pip3 instead of pip.

I'm installing ubuntu in a VM to debug the zlib issue, will report back here.

HadrienG commented 4 years ago

I could not reproduce your issue on a clean ubuntu 18.04 installation.

Can you share with me:

ma-celik commented 4 years ago

Sorry for troubling you Hadrien

Python 3.7.4 for iss -v iss version 1.4.4 python -c "import zlib; print(zlib.version)" 1.0 python -c "import requests; print(requests.version)" 2.22.0

If I may, the output of iss generate is R1 and R2, is there anyway it produces in single end. I have been trying to merge them with BBmerge/FLASH but they only merge very small amount of it like 0,04. Thanks a lot for your help.

HadrienG commented 4 years ago

There is currently no way to produce single-end data with InSilicoSeq. I guess you could keep only the forward fragments.

It makes sense that R1 and R2 did not merge well, since for the insert sizes are usually >0.

As for the issue at hand, we have the same version numbers, I have no idea why you are unable to download genomes with the --ncbi option.

ma-celik commented 4 years ago

Hadrien thanks a lot. I have been checking the read simulators. Yours is simple and fast. Just great. Thanks for this tool.

HadrienG commented 4 years ago

Thanks 😄 I'll close this for now, but don't hesitate to comment back on the issue if you find a solution to your --ncbi issue on ubuntu 18.08

Best, Hadrien

arthurvinx commented 4 years ago

Hi @HadrienG, I faced the same problem today on ubuntu 19.10.

I installed iss, via conda, in a python 3.7.6 environment.

iss -v iss version 1.4.5 python -c "import zlib; print(zlib.version)" 1.0 python -c "import requests; print(requests.version)" 2.22.0

I was trying to obtain 1,374 random bacteria sequences to generate the reads. The error occurred several times and I checked the number of sequences downloaded using grep "^>" -c <file>. The minimum number of sequences downloaded was 27, and the maximum was near 700.

I thought to remove the last sequence and merge the files, but the last RefSeq sequence identifier did not matched any listed for the last RefSeq Assembly ID printed before the error. I tried to use --cpus 1 but the same thing occurred.

Is there a way to force a sequencial write in the output?

And by the way, thanks for the software, it will help me a lot to do some benchmarks.

EDIT 1:

I checked the penultimate RefSeq Assembly ID printed before the error and it matched the last RefSeq sequence identifier found in the output. Thus, the download and write seems to be sequential, and there is no need to modify the output end. I will follow merging the downloaded files for each attempt.

1 2 3 4

EDIT 2:

@HadrienG , I ran iss generate --cpus 1 --model miseq -k bacteria -U 1374 --n_reads 27480 --output <file> --seed 1 twice.

Both outputs have 141 sequences, the problem seems to be in the ncbi files. An exception handling this issue may be an easy fix.

HadrienG commented 4 years ago

This points to being a rate limit of the NCBI entrez API. I'm handle the exception in the next minor release, but I might try to throttle the downloads about a certain n genomes to prevent it from happening.

The NCBI querying and downloading is not multithreaded so --cpus should not change anything.

PS: if you want to download the same genomes everytime you should use --seed

Thanks a lot for your detailed bug report! /Hadrien

arthurvinx commented 4 years ago

Thanks @HadrienG.

I intended to download random sequences and I noticed that the abundance file, automatically created, keeps only unique IDs.

Thus, even not setting a seed and merging files with duplicated genomes, this file helped me to keep track of the correct number of different genomes.

Also, thanks to the use of the sequence ID as header, it was easy to get the number of reads for each sequence.

The iss is indeed a great and well implemented software.

HadrienG commented 4 years ago

Should be fixed (finally!) in 1.4.6