GDKO / AvP

Automatic evaluation of HGTs
GNU General Public License v3.0
18 stars 2 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 2: invalid continuation byte #5

Closed alexvasilikop closed 1 year ago

alexvasilikop commented 1 year ago

Hello,

I am trying to run the prepare module but I get the following error: $ ./avp prepare -a proteins.blastp.diamond.txt_ai.out -o results -f proteins.fa -b blastp.diamond.txt -x groups.yaml -c config.yaml

Output: [+] Setting up [!] Selected 6750 HGT candidates [+] Parsing Blast file and grouping similar queries [!] Formed 4056 groups [+] Extracting hits from DB Traceback (most recent call last): File "./avp", line 6, in main() File "/media/urbe/MyCDrive1/Alex/AVP/AvP/depot/interface.py", line 29, in main prepare.main() File "/media/urbe/MyCDrive1/Alex/AVP/AvP/depot/prepare.py", line 229, in main for record in SeqIO.parse(handle,"fasta"): File "/home/urbe/anaconda3/envs/avp/lib/python3.7/site-packages/Bio/SeqIO/Interfaces.py", line 74, in next return next(self.records) File "/home/urbe/anaconda3/envs/avp/lib/python3.7/site-packages/Bio/SeqIO/FastaIO.py", line 198, in iterate for title, sequence in SimpleFastaParser(handle): File "/home/urbe/anaconda3/envs/avp/lib/python3.7/site-packages/Bio/SeqIO/FastaIO.py", line 47, in SimpleFastaParser for line in handle: File "/home/urbe/anaconda3/envs/avp/lib/python3.7/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 2: invalid continuation byte

The content of the config.yaml file is the following: max_threads: 10

-> # DB path sp_fasta_path: /media/urbe/MyCDrive1/Alex/AVP/uniref50/uniref50.fasta.dmnd nr_db_path: /media/urbe/MyCDrive1/Alex/09.BlobTools/blobtoolkit/nt

-> ## Algorithm options -> # prepare ai_cutoff: 0 percent_identity: 100 cutoffextend: 20 # when toi hit is found, we take this hit + n hits trimal: false min_num_hits: 4 # select queries with at least that many blast hits percentage_similar_hits: 0.7 # group queries based on this mode: sp # use nr for nr database, use sp for swissprot database -> # detect, clasify, evaluate fastml: true # Use fasttree instead of IQTree node_support: 0 # nodes below that number will collapse complex_per_toi: 20 # if H/(H+T) smaller than this then node is considered T complex_per_hgt: 80 # if H/(H+T) greater than this then node is considered H complex_per_node: 90 # if node contains percent number of this category, it is assigned

-> # Program specific options mafft_options: '--anysymbol --auto' trimal_options: '-automated1'

-> #IQ-Tree iqmodel: '-mset WAG,LG,JTT,DCMUT,JTTDCMUT -AICc -mrate E,I,G,I+G,R -madd LG4X' ufbootstrap: 1000 iq_threads: 10

What might be the issue? Thanks

GDKO commented 1 year ago

Hi,

When using a swissprot or uniref or custom database the path should point to the fasta file used to create the database.

Concerning the mode, check https://github.com/GDKO/AvP/wiki/Setting-up#Databases , if you used custom-build or the ur90 please use ur90 in the config file mode option.

Cheers, Georgios

alexvasilikop commented 1 year ago

Hi Georgie,

Thanks for the reply. I fixed the database path and the prepare run is finished. However I get the following warnings: [+] Setting up [!] Selected 6750 HGT candidates [+] Parsing Blast file and grouping similar queries [!] Formed 4056 groups [+] Extracting hits from DB [+] Writing fasta files /home/urbe/anaconda3/envs/avp/lib/python3.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py:243: UserWarning: taxid 1472165 was translated into 128442 warnings.warn("taxid %s was translated into %s" %(taxid, merged_conversion[taxid])) /home/urbe/anaconda3/envs/avp/lib/python3.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py:243: UserWarning: taxid 1535326 was translated into 5475 warnings.warn("taxid %s was translated into %s" %(taxid, merged_conversion[taxid])) /home/urbe/anaconda3/envs/avp/lib/python3.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py:243: UserWarning: taxid 206315 was translated into 2823264 warnings.warn("taxid %s was translated into %s" %(taxid, merged_conversion[taxid])) /home/urbe/anaconda3/envs/avp/lib/python3.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py:243: UserWarning: taxid 1391700 was translated into 84754 warnings.warn("taxid %s was translated into %s" %(taxid, merged_conversion[taxid])) /home/urbe/anaconda3/envs/avp/lib/python3.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py:243: UserWarning: taxid 441943 was translated into 2961670 warnings.warn("taxid %s was translated into %s" %(taxid, merged_conversion[taxid])) /home/urbe/anaconda3/envs/avp/lib/python3.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py:243: UserWarning: taxid 2829818 was translated into 2952571 warnings.warn("taxid %s was translated into %s" %(taxid, merged_conversion[taxid])) /home/urbe/anaconda3/envs/avp/lib/python3.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py:243: UserWarning: taxid 419596 was translated into 52021 warnings.warn("taxid %s was translated into %s" %(taxid, merged_conversion[taxid])) [!] Skipped 0 hits and 0 taxids. [+] Aligning fasta files [x] 100% [!] Finished with 6750 HGT candidates in 4056 groups

Is this something to just ignore?

Thanks again Alex

GDKO commented 1 year ago

Hi Alex,

These warnings inform that some taxids have changed in the ncbi taxonomy after the database you used was created. It is safe to ignore.

Cheers, Georgios

alexvasilikop commented 1 year ago

Great many thanks