Gurdhhu / bioinf_scripts

10 stars 3 forks source link

input file from a local blast run #6

Open yeroslaviz opened 3 years ago

yeroslaviz commented 3 years ago

Hi,

I have run the magicblast tool to BLAST my fastq file. This gets me a tab-delimited output file with the fowllowing first three columns (and many more):

  1. Query/read sequence identifier
  2. Reference sequence identifier
  3. Percent identity of the alignment

and (for now) looks like that

# MAGICBLAST 1.5.0
# magicblast -query filter.fastq.gz -db BLASTdb/nt -infmt fastq -outfmt tabular -num_threads 5 -out magicblast.nt.txt 
# Fields: query acc.    reference acc.  % identity      not used        not used        not used        query start     query end       reference start reference end   not used        not used        score   query strand    reference strand        query length    BTOP  num placements   not used        compartment     left overhang   right overhang  mate reference  mate ref. start composite score
NB500982:283:HH3WJAFX2:1:11101:24257:1057       -       0       0       0       0       0       0       0       0       0       99      0       -       -       75      -       0       -       -       -       -       -       -       0
NB500982:283:HH3WJAFX2:1:11101:19494:1058       -       0       0       0       0       0       0       0       0       0       99      0       -       -       75      -       0       -       -       -       -       -       -       0
...
NB500982:283:HH3WJAFX2:1:11101:4399:1061        gi|31072033     97.2603 0       0       0       1       73      71005   71077   0       99      63      plus    plus    73      7CG5CG59        4       -       1:3     -       -       -       -       63
NB500982:283:HH3WJAFX2:1:11101:4399:1061        gi|37574168     97.2603 0       0       0       1       73      115081  115153  0       99      63      plus    plus    73      7CG5CG59        4       -       1:4     -       -       -       -       63
NB500982:283:HH3WJAFX2:1:11101:4399:1061        gi|239799573    97.2603 0       0       0       1       73      99045   98973   0       99      63      plus    minus   73      7CG5CG59        4       -       1:5     -       -       -       -       63
NB500982:283:HH3WJAFX2:1:11101:4399:1061        gi|636536526    97.2603 0       0       0       1       73      44595   44523   0       99      63      plus    minus   73      7CG5CG59        4       -       1:6     -       -       -       -       63
...

I can use awk to either reorder the columns or change it to a csv file format. I just need to understand better what the input should look like.

There is though a conflict between what is written in the description and your example input files, if I understand them correctly.

You are writing the accession number should be in the fourth column, but in you example input files, these numbers are in the second column. Where do I need to have them?

Another question is - do I need all the other columns? or can I just use the query and reference IDs?

I have run the executable with the output file i have now (after converting it to csv) and got the following error:

Fetching annotation for 42606806 accessions from GenBank...
Traceback (most recent call last):
  File "urllib/request.py", line 1319, in do_open
  File "http/client.py", line 1230, in request
  File "http/client.py", line 1276, in _send_request
  File "http/client.py", line 1225, in endheaders
  File "http/client.py", line 1004, in _send_output
  File "http/client.py", line 944, in send
  File "http/client.py", line 1399, in connect
  File "ssl.py", line 500, in wrap_socket
  File "ssl.py", line 1040, in _create
  File "ssl.py", line 1309, in do_handshake
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1108)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "annotate_blast_hits.py", line 112, in <module>
  File "annotate_blast_hits.py", line 72, in <module>
  File "annotate_blast_hits.py", line 52, in get
  File "site-packages/Bio/Entrez/__init__.py", line 199, in efetch
  File "site-packages/Bio/Entrez/__init__.py", line 569, in _open
  File "urllib/request.py", line 222, in urlopen
  File "urllib/request.py", line 525, in open
  File "urllib/request.py", line 542, in _open
  File "urllib/request.py", line 502, in _call_chain
  File "urllib/request.py", line 1362, in https_open
  File "urllib/request.py", line 1322, in do_open
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1108)>
[62293] Failed to execute script annotate_blast_hits

any Ideas?

thanks

Assa

Gurdhhu commented 3 years ago

Hi Assa! I have just updated the script, the executable, and their description in the readme, it should work with your input table now.

The script now accepts tables with different separators: comma, tabulation, semicolon.

Just remove all the lines where GenBank accession number is absent, and in the command, add an argument --column with the position of the column where GenBank numbers are stored (counting from zero). Default is --column 1 (that is, 2nd column).

About the SSL certificate error, I do not know what caused it, but after I updated the executable the error disappeared, try it now.

yeroslaviz commented 3 years ago

Thanks for the fast changes. I have downloaded the new annotate_blast_hits standalone and tried to run it. I have removed the three comment lines from the beginning of my tab-delimited file, which now look like that at the top:

$head  magicblast.nt.noComments.txt
NB500982:283:HH3WJAFX2:1:11101:24257:1057       -       0       0       0       0       0       0       0       0       0       99      0       -       -       75      -       0       -       -       -       -       -       -       0
NB500982:283:HH3WJAFX2:1:11101:19494:1058       -       0       0       0       0       0       0       0       0       0       99      0       -       -       75      -       0       -       -       -       -       -       -       0
...

But I get this error:

Error loading Python lib '/tmp/_MEIp2FtJz/libpython3.8.so.1.0': dlopen: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by /tmp/_MEIp2FtJz/libpython3.8.so.1.0)`

Do I need to install some packages?

When trying to explicitly running it with python3 I get this error

$ python3 annotate_blast_hits magicblast.nt.noComments.txt n my@mailaddress -c 1
SyntaxError: Non-UTF-8 code starting with '\xe8' in file /local/Assa/projects/Tachibana/Chad/P286/annotate_blast_hits on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Gurdhhu commented 3 years ago

The Pyinstaller documentation says:

The executable that PyInstaller builds is not fully static, in that it still depends on the system libc. Under Linux, the ABI of GLIBC is backward compatible, but not forward compatible. So if you link against a newer GLIBC, you can't run the resulting executable on an older system.

So it seems that your system has an older version of GLIBC than the environment where I compiled the standalone (it has GNU libc 2.31).

I guess the easiest solution for you would be to install package Biopython and run the script file annotate_blast_hits.py instead of the standalone. Don't forget to make it executable: chmod +x annotate_blast_hits.py

In your input table, you would also need to remove the lines where you have "-" instead of GenBank accession number: the script assumes that you have GB accession in every line.

I hope this will work for you. Best, Oleg

yeroslaviz commented 3 years ago

Thanks, I think I got one step closer to running the script :-)

Unfortunately my server have either Xenial or Bionic as with older GLIBC versions. I guess you are running Ubuntu 20?

I have done the changes, and cloned the repository to have the newest version of it. I have installed the Biopython in my conda environment and tried to run the command again.

It had connected to GeneBank, but when trying to get the annotations is throws a new error:

$ python bioinf_scripts/annotate_blast_hits.py  magicblast.nt.giIDfiltered.txt n my@mailaddress -c 1
Column separator: tabulation
Fetching annotation for 9729793 accessions from GenBank...
Traceback (most recent call last):
  File "bioinf_scripts/annotate_blast_hits.py", line 138, in <module>
    raise err
  File "bioinf_scripts/annotate_blast_hits.py", line 94, in <module>
    get(ids_to_fetch, step)
  File "bioinf_scripts/annotate_blast_hits.py", line 78, in get
    out.write(handle.read())
TypeError: write() argument must be str, not bytes

It is something in my file again?

thanks Assa

Gurdhhu commented 3 years ago

I'm using Manjaro Linux 19.0.2 (quite old already as well)

Wow, it's not getting easier :) I currently have no clue on what causes this error - it never occurred with any of my test data. Can you try to run it with the test input to see if it works? https://raw.githubusercontent.com/Gurdhhu/bioinf_scripts/master/blastn_example_input.csv

yeroslaviz commented 3 years ago

Good Morning, sorry for the late reply. Unfortunately it gives me the same error.

$ python bioinf_scripts/annotate_blast_hits.py  bioinf_scripts/blastn_example_input.csv n my@mailaddress -c 1
Column separator: comma
Fetching annotation for 150 accessions from GenBank...
Traceback (most recent call last):
  File "/local/Assa/projects/Tachibana/Chad/P286/bioinf_scripts/annotate_blast_hits.py", line 138, in <module>
    raise err
  File "/local/Assa/projects/Tachibana/Chad/P286/bioinf_scripts/annotate_blast_hits.py", line 100, in <module>
    get(ids_to_fetch, step)
  File "/local/Assa/projects/Tachibana/Chad/P286/bioinf_scripts/annotate_blast_hits.py", line 78, in get
    out.write(handle.read())
TypeError: write() argument must be str, not bytes
yeroslaviz commented 3 years ago

Hi again,

so I have found an indirect solution. I created a new virtual machine specifically for that reason with Ubuntu 20.04, this one has now GLIBC_2.31.

I have tested your newest github version. The python script throws still the same error as above, but the executable now runs you example data. I am now running my file, which is a bit bigger, but for now no errors.

So it seems there is something wrong with the python script itself. Which is strange, because I would have guessed that it is the same one within the executable.

I'll keep you up-to-date, if I get new errors.

thanks for all the help.

Assa

Gurdhhu commented 3 years ago

Hi Assa,

I'm glad that you were able to make it work in spite of having to apply indirect solutions. The standalone was compiled based on the latest script version, so it is really weird that the script and the standalone behave differently. I was not able to reproduce this TypeError on my machine, and I still have no good idea about what causes it.

I hope you will not have any more problems with the script, but if you will, don't hesitate to contact me.

Best, Oleg

yeroslaviz commented 3 years ago

Hi,

I don't know id it is the intended solution, but by adding the binary flag the python script is working now

in line 77

    with open(tmpfile, "wb") as out:  # Writing a temporary file that will be removed after parsing

and in line 146

        with open(tmp, "rb") as f:

now it works

Gurdhhu commented 3 years ago

That's interesting - I still do not understand why on your server handle.read() for the output of the function Entrez.efetch with retmode='xml' returns binary data instead of xml. Anyway, good that you have found a solution. I hope you finally were able to annotate your huge input table.

yeroslaviz commented 3 years ago

Good morning, unfortunately it still throws some errors,

I have splitted the BLAST results in 10 subsets to be able to do it in parallel, but it now throws the following error

$ python bioinf_scripts/annotate_blast_hits.py  NEB_adapter/WK56.R2.unmapped.magicblast.nt.giIDfiltered_subset08.txt n yeroslaviz@biochem.mpg.de -c 1
Column separator: tabulation
Fetching annotation for 1000000 accessions from GenBank...
Traceback (most recent call last):
  File "/home/yeroslaviz/miniconda3/lib/python3.6/http/client.py", line 556, in _get_chunk_left
    chunk_left = self._read_next_chunk_size()
  File "/home/yeroslaviz/miniconda3/lib/python3.6/http/client.py", line 523, in _read_next_chunk_size
    return int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/yeroslaviz/miniconda3/lib/python3.6/http/client.py", line 573, in _readall_chunked
    chunk_left = self._get_chunk_left()
  File "/home/yeroslaviz/miniconda3/lib/python3.6/http/client.py", line 558, in _get_chunk_left
    raise IncompleteRead(b'')
http.client.IncompleteRead: IncompleteRead(0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bioinf_scripts/annotate_blast_hits.py", line 138, in <module>
    raise err
  File "bioinf_scripts/annotate_blast_hits.py", line 94, in <module>
    get(ids_to_fetch, step)
  File "bioinf_scripts/annotate_blast_hits.py", line 78, in get
    out.write(handle.read())
  File "/home/yeroslaviz/miniconda3/lib/python3.6/http/client.py", line 466, in read
    return self._readall_chunked()
  File "/home/yeroslaviz/miniconda3/lib/python3.6/http/client.py", line 580, in _readall_chunked
    raise IncompleteRead(b''.join(value))
http.client.IncompleteRead: IncompleteRead(1074161172 bytes read)

Any Ideas, what could have cause this?

thaks Assa