antismash / antismash

antiSMASH
https://antismash.secondarymetabolites.org
GNU Affero General Public License v3.0
186 stars 64 forks source link

Antisamsh UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff #501

Closed Gian77 closed 1 year ago

Gian77 commented 1 year ago

Hello,

Describe the bug I am running antismash over a set of assembled genomes and I am having a weird error on some of the genomes - it runs fine on most, but give me this error on a few. Please see the error below.

========== antismash for file: /mnt/home/benucci/project_82_genomes/data/PvP119-Illumina_Bacillus_subtilis_52634.2.402975.CAACGGAT-CAACGGAT_results/PROKKA_10202022.faa ==========

version of antismash: antiSMASH 6.1.1
INFO     23/11 17:21:05   antiSMASH version: 6.1.1
INFO     23/11 17:21:05   diamond using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/diamond (0.9.24)
INFO     23/11 17:21:05   hmmpfam2 using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/hmmpfam2 (2.3.2)
INFO     23/11 17:21:05   fasttree using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/fasttree
INFO     23/11 17:21:05   hmmsearch using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/hmmsearch (3.1b2)
INFO     23/11 17:21:05   hmmpress using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/hmmpress (3.1b2)
INFO     23/11 17:21:05   hmmscan using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/hmmscan (3.1b2)
INFO     23/11 17:21:05   meme using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/meme (4.11.2)
INFO     23/11 17:21:05   fimo using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/fimo (4.11.2)
INFO     23/11 17:21:05   glimmerhmm using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/glimmerhmm
INFO     23/11 17:21:05   prodigal using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/prodigal (V2.6.3)
INFO     23/11 17:21:05   muscle using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/muscle (v3.8.1551)
INFO     23/11 17:21:05   java using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/java (11.0.13)
INFO     23/11 17:21:05   blastp using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/blastp (2.5.0+)
INFO     23/11 17:21:05   makeblastdb using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/makeblastdb (2.5.0+)
INFO     23/11 17:21:05   Parsing input sequence '/mnt/home/benucci/project_82_genomes/data/PvP119-Illumina_Bacillus_subtilis_52634.2.402975.CAACGGAT-CAACGGAT_results/PROKKA_10202022.fna'
WARNING  23/11 17:21:07   Fasta header too long: renamed "gnl|AIT|--prefix_1" to "c00001_gnl|AIT.."
WARNING  23/11 17:21:07   Fasta header too long: renamed "gnl|AIT|--prefix_2" to "c00002_gnl|AIT.."
WARNING  23/11 17:21:07   Fasta header too long: renamed "gnl|AIT|--prefix_4" to "c00003_gnl|AIT.."
WARNING  23/11 17:21:07   Fasta header too long: renamed "gnl|AIT|--prefix_5" to "c00004_gnl|AIT.."
WARNING  23/11 17:21:07   Fasta header too long: renamed "gnl|AIT|--prefix_6" to "c00005_gnl|AIT.."
WARNING  23/11 17:21:07   Fasta header too long: renamed "gnl|AIT|--prefix_7" to "c00006_gnl|AIT.."
WARNING  23/11 17:21:07   Fasta header too long: renamed "gnl|AIT|--prefix_8" to "c00007_gnl|AIT.."
WARNING  23/11 17:21:07   Fasta header too long: renamed "gnl|AIT|--prefix_9" to "c00008_gnl|AIT.."
WARNING  23/11 17:21:07   Fasta header too long: renamed "gnl|AIT|--prefix_10" to "c00009_gnl|AIT.."
WARNING  23/11 17:21:07   Fasta header too long: renamed "gnl|AIT|--prefix_11" to "c00010_gnl|AIT.."
WARNING  23/11 17:21:07   Fasta header too long: renamed "gnl|AIT|--prefix_12" to "c00011_gnl|AIT.."
INFO     23/11 17:21:09   No genes found, skipping record
INFO     23/11 17:21:09   No genes found, skipping record
INFO     23/11 17:21:09   No genes found, skipping record
INFO     23/11 17:21:10   Analysing record: c00003_gnlAIT..
INFO     23/11 17:21:10   Detecting secondary metabolite clusters
INFO     23/11 17:21:10   Running antismash.detection.hmm_detection
INFO     23/11 17:21:10   HMM detection using strictness: relaxed
INFO     23/11 17:21:18   7 region(s) detected in record
INFO     23/11 17:21:18   Running antismash.detection.genefunctions
INFO     23/11 17:21:31   Running antismash.detection.nrps_pks_domains
INFO     23/11 17:21:36   Running antismash.modules.lanthipeptides
Traceback (most recent call last):
  File "/mnt/home/benucci/anaconda2/envs/antismash/bin/antismash", line 10, in <module>
    sys.exit(entrypoint())
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/__main__.py", line 125, in entrypoint
    sys.exit(main(sys.argv[1:]))
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/__main__.py", line 113, in main
    antismash.run_antismash(sequence, options)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/main.py", line 674, in run_antismash
    result = _run_antismash(sequence_file, options)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/main.py", line 734, in _run_antismash
    analysis_timings = analyse_record(record, options, get_analysis_modules(), module_results)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/main.py", line 299, in analyse_record
    run_module(record, module, options, previous_result, timings)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/main.py", line 271, in run_module
    results = module.run_on_record(record, results, options)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/modules/lanthipeptides/__init__.py", line 111, in run_on_record
    return run_specific_analysis(record)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/modules/lanthipeptides/specific_analysis.py", line 757, in run_specific_analysis
    run_lanthi_on_genes(record, gene, cluster, neighbours, results)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/modules/lanthipeptides/specific_analysis.py", line 705, in run_lanthi_on_genes
    result_vec = run_lanthipred(record, candidate, lant_class, domains)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/modules/lanthipeptides/specific_analysis.py", line 559, in run_lanthipred
    cleavage_result = predict_cleavage_site(profile, lan_a_fasta)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/modules/lanthipeptides/specific_analysis.py", line 435, in predict_cleavage_site
    hmmer_res = subprocessing.run_hmmpfam2(query_hmmfile, target_sequence)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/common/subprocessing/hmmpfam.py", line 39, in run_hmmpfam2
    result = execute(command, stdin=target_sequence)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/common/subprocessing/base.py", line 95, in execute
    stderr == PIPE)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/common/subprocessing/base.py", line 32, in __init__
    self.stdout = stdout.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 542: invalid start byte

System (please complete the following information): I am running Antismash from an HPC that mount:

[benucci@dev-amd20 ~]$ uname -r
3.10.0-1160.36.2.el7.x86_64

[benucci@dev-amd20 ~]$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Antismash run

antismash \
    --cpus $cores   \
    -v \
    --taxon bacteria \
    --genefinding-gff3 /mnt/home/benucci/project_82_genomes/data/${output_dir}/PROKKA_*.gff \
    --genefinding-tool none \
    --output-dir /mnt/home/benucci/project_82_genomes/data/${output_dir}/antismash/ \
    /mnt/home/benucci/project_82_genomes/data/${output_dir}/PROKKA_*.fna

How I tried to solve it, with no success I saw there is a similar post about this errorr, but is related to run Antismash from Docker, run on singularity, while I installed it through conda, and I am on an HPC. I tried exporting the LANG variable like export LANG=C.UTF-8, but it seems not to work either. Any clue? I can send one of the genome that failed to run if needed to reproduce this. Thanks much in advance! -Gian

SJShaw commented 1 year ago

I suspect your input is using UTF-16 or other character format that doesn't work with UTF-8. If you try the file command on your inputs, that will hopefully tell you what encoding the file uses, e.g.

$ file README.md 
README.md: UTF-8 Unicode text

If it is UTF-8, we'll need the contig that causes the problem to see if we can work out what's going on. If it isn't UTF-8, then the easier solution is to convert the encoding prior to using antiSMASH with something like iconv.

Gian77 commented 1 year ago

Hi @SJShaw,

thanks a lo for the fast answer. I checked all my genomes PROKKA_*.fna files and the encodeed as ASCII

[benucci@dev-amd20 code]$ for dir in ../data/*/; do bgc=$(find $dir -type f -name "PROKKA*fna"); file -bi $bgc; done
text/plain; charset=us-ascii
text/plain; charset=us-ascii
text/plain; charset=us-ascii
...

isn't ascii a subset of UTF-8 encode? Also, they all are the same so I am not sure why antismash run on only some of the contigs files.

I read that file is not always precise so I tried recode as follows and gave me the same result.

[benucci@dev-amd20 code]$ for dir in ../data/*/; do bgc=$(find $dir -type f -name "PROKKA*fna"); if recode utf8/..UCS < $bgc >/dev/null 2>&1; then echo "Valid utf8 : $bgc"; else echo "NOT valid utf8: $bgc"; fi; done
Valid utf8 : ../data/PvP001-Pacbio_Pseudomonas_coleopterorum_pbio-2432.22830.bc1021_BAK8B_OA--bc1021_BAK8B_OA.ccs_results/PROKKA_10122022.fna
Valid utf8 : ../data/PvP002-Illumina_Pseudomonas_oryzihabitans_D_52634.2.402975.ATCGATCG-ATCGATCG_results/PROKKA_10242022.fna
Valid utf8 : ../data/PvP003-Illumina_Pseudomonas_syringae_52616.3.395054.CGTTGCAA-CGTTGCAA_results/PROKKA_10242022.fna
...

I can send you a couple of genomes if you give me an email address, I am sorry I cannot attach in here since they aren't public in JGi yet.

Thanks so much,

-Gian

Gian77 commented 1 year ago

Hi @SJShaw,

I tried to convert the ASHII into UTF-8 for one of e the genome contigs file was not working (Please see above).

[benucci@dev-amd20 testing]$ iconv -f ASCII -t UTF-8 < PROKKA_10112022.fna > PROKKA_10112022_conv.fna 
[benucci@dev-amd20 testing]$ iconv -f ASCII -t UTF-8 < PROKKA_10112022.gff > PROKKA_10112022_conv.gff 

Then I run antismash again those specific files as follows:

(antismash) [benucci@dev-amd20 testing]$ antismash --cpus 20 -v --taxon bacteria --genefinding-gff3 PROKKA_10112022_conv.gff --genefinding-tool none --output-dir antismash/ PROKKA_10112022_conv.fna 

And I invitably got the same UnicodeDecodeError, please see below

INFO     12/12 13:32:37   antiSMASH version: 6.1.1
INFO     12/12 13:32:37   diamond using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/diamond (0.9.24)
INFO     12/12 13:32:37   hmmpfam2 using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/hmmpfam2 (2.3.2)
INFO     12/12 13:32:37   fasttree using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/fasttree
INFO     12/12 13:32:37   hmmsearch using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/hmmsearch (3.1b2)
INFO     12/12 13:32:37   hmmpress using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/hmmpress (3.1b2)
INFO     12/12 13:32:37   hmmscan using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/hmmscan (3.1b2)
INFO     12/12 13:32:37   meme using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/meme (4.11.2)
INFO     12/12 13:32:37   fimo using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/fimo (4.11.2)
INFO     12/12 13:32:37   glimmerhmm using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/glimmerhmm
INFO     12/12 13:32:37   prodigal using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/prodigal (V2.6.3)
INFO     12/12 13:32:37   muscle using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/muscle (v3.8.1551)
INFO     12/12 13:32:38   java using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/java (11.0.13)
INFO     12/12 13:32:38   blastp using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/blastp (2.5.0+)
INFO     12/12 13:32:38   makeblastdb using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/makeblastdb (2.5.0+)
INFO     12/12 13:32:38   Parsing input sequence 'PROKKA_10112022.fna'
INFO     12/12 13:32:39   GFF3 and sequence have only one record. Assuming is the same as long as coordinates are compatible.
WARNING  12/12 13:32:41   Fasta header too long: renamed "gnl|AIT|--prefix_1" to "c00001_gnl|AIT.."
INFO     12/12 13:32:46   Analysing record: c00001_gnlAIT..
INFO     12/12 13:32:46   Detecting secondary metabolite clusters
INFO     12/12 13:32:46   Running antismash.detection.hmm_detection
INFO     12/12 13:32:46   HMM detection using strictness: relaxed
INFO     12/12 13:33:02   16 region(s) detected in record
INFO     12/12 13:33:02   Running antismash.detection.genefunctions
INFO     12/12 13:33:38   Running antismash.detection.nrps_pks_domains
INFO     12/12 13:33:49   Running antismash.modules.lanthipeptides
Traceback (most recent call last):
  File "/mnt/home/benucci/anaconda2/envs/antismash/bin/antismash", line 10, in <module>
    sys.exit(entrypoint())
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/__main__.py", line 125, in entrypoint
    sys.exit(main(sys.argv[1:]))
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/__main__.py", line 113, in main
    antismash.run_antismash(sequence, options)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/main.py", line 674, in run_antismash
    result = _run_antismash(sequence_file, options)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/main.py", line 734, in _run_antismash
    analysis_timings = analyse_record(record, options, get_analysis_modules(), module_results)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/main.py", line 299, in analyse_record
    run_module(record, module, options, previous_result, timings)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/main.py", line 271, in run_module
    results = module.run_on_record(record, results, options)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/modules/lanthipeptides/__init__.py", line 111, in run_on_record
    return run_specific_analysis(record)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/modules/lanthipeptides/specific_analysis.py", line 757, in run_specific_analysis
    run_lanthi_on_genes(record, gene, cluster, neighbours, results)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/modules/lanthipeptides/specific_analysis.py", line 705, in run_lanthi_on_genes
    result_vec = run_lanthipred(record, candidate, lant_class, domains)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/modules/lanthipeptides/specific_analysis.py", line 577, in run_lanthipred
    hmmer_profiles[lant_class], lant_class)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/modules/lanthipeptides/specific_analysis.py", line 513, in determine_precursor_peptide_candidate
    cleavage_result = run_cleavage_site_phmm(lan_a_fasta, hmmer_profile, THRESH_DICT[lant_class])
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/modules/lanthipeptides/specific_analysis.py", line 478, in run_cleavage_site_phmm
    return predict_cleavage_site(profile, fasta, threshold)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/modules/lanthipeptides/specific_analysis.py", line 435, in predict_cleavage_site
    hmmer_res = subprocessing.run_hmmpfam2(query_hmmfile, target_sequence)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/common/subprocessing/hmmpfam.py", line 39, in run_hmmpfam2
    result = execute(command, stdin=target_sequence)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/common/subprocessing/base.py", line 95, in execute
    stderr == PIPE)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/common/subprocessing/base.py", line 32, in __init__
    self.stdout = stdout.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 542: invalid start byte

What should I do? I can send this contigs file to you if helps. Thanks so much! -Gian

Gian77 commented 1 year ago

Hello again, Can I have an email where to send the contigs? Thanks co much! Gian

SJShaw commented 1 year ago

If you head here and start a ticket, you can then respond to the email you get, attaching the (first?) offending contig.

Gian77 commented 1 year ago

@SJShaw Thanks a lot, will do it right now.

Gian77 commented 1 year ago

Hello @SJShaw did you get the contig that was giving me troubles? I sent it twice. Please let me know, otherwise I will attach it in here, even if unpublished should not be a big issue. Thanks much! G.

SJShaw commented 1 year ago

No, nothing's come through. I suspect it might be over the file size limit for the email system.

If you'd prefer a slightly more private version of sharing it, you could submit it to the antiSMASH webservice and then forward the job ID along to that ticket you created.

Gian77 commented 1 year ago

Hello @SJShaw ,

No worries, think sharing it here is good for now. I am sending contig files from 2 genomes for now. One worked ok with Antismash (in case you need it as a comparison), the other did not work. They are in separate folders. In total, I have about 20 contig files (out of about 110) from genomes for which Antismash gave me the same issue. If you need I can send them all (in that case maybe through email or ftp).

Thanks a lot, Gian

SJShaw commented 1 year ago

Well, the bad news is that both genomes work fine for me, which means it's likely an environment problem.

Neither of these have a lanthipeptide protocluster, which is the module where your example logs above have the error. For the non-working variant in your upload, could you give the stack trace that you get for it?

Edit: The non-working one does have a thiopeptide, which would use very similar logic as the lanthipeptide, where the working one doesn't.

Digging into the GFF file, I can see some URL encoded elements, like product=3'%2C5'-cyclic adenosine, which the non-working variant has more of. The general CDS naming scheme starting with --prefix also seems a little risky. It's possible that your particular hmmpfam2 doesn't handle those very well at all.

Gian77 commented 1 year ago

@SJShaw

interesting... Here is the trace...

(antismash) [benucci@dev-amd20 antismash_test]$ antismash --cpus 20 -v --taxon bacteria --genefinding-gff3 PROKKA_10242022.gff --genefinding-tool none --output-dir output/ PROKKA_10242022.fna 
INFO     12/01 12:15:52   antiSMASH version: 6.1.1
INFO     12/01 12:15:52   diamond using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/diamond (0.9.24)
INFO     12/01 12:15:52   hmmpfam2 using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/hmmpfam2 (2.3.2)
INFO     12/01 12:15:52   fasttree using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/fasttree
INFO     12/01 12:15:52   hmmsearch using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/hmmsearch (3.1b2)
INFO     12/01 12:15:52   hmmpress using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/hmmpress (3.1b2)
INFO     12/01 12:15:52   hmmscan using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/hmmscan (3.1b2)
INFO     12/01 12:15:52   meme using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/meme (4.11.2)
INFO     12/01 12:15:52   fimo using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/fimo (4.11.2)
INFO     12/01 12:15:52   glimmerhmm using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/glimmerhmm
INFO     12/01 12:15:52   prodigal using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/prodigal (V2.6.3)
INFO     12/01 12:15:52   muscle using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/muscle (v3.8.1551)
INFO     12/01 12:15:52   java using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/java (11.0.13)
INFO     12/01 12:15:53   blastp using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/blastp (2.5.0+)
INFO     12/01 12:15:53   makeblastdb using executable: /mnt/home/benucci/anaconda2/envs/antismash/bin/makeblastdb (2.5.0+)
INFO     12/01 12:15:53   Parsing input sequence 'PROKKA_10242022.fna'
WARNING  12/01 12:15:55   Fasta header too long: renamed "gnl|AIT|--prefix_1" to "c00001_gnl|AIT.."
WARNING  12/01 12:15:55   Fasta header too long: renamed "gnl|AIT|--prefix_2" to "c00002_gnl|AIT.."
WARNING  12/01 12:15:55   Fasta header too long: renamed "gnl|AIT|--prefix_3" to "c00003_gnl|AIT.."
WARNING  12/01 12:15:55   Fasta header too long: renamed "gnl|AIT|--prefix_4" to "c00004_gnl|AIT.."
WARNING  12/01 12:15:55   Fasta header too long: renamed "gnl|AIT|--prefix_5" to "c00005_gnl|AIT.."
WARNING  12/01 12:15:55   Fasta header too long: renamed "gnl|AIT|--prefix_6" to "c00006_gnl|AIT.."
WARNING  12/01 12:15:55   Fasta header too long: renamed "gnl|AIT|--prefix_7" to "c00007_gnl|AIT.."
WARNING  12/01 12:15:55   Fasta header too long: renamed "gnl|AIT|--prefix_8" to "c00008_gnl|AIT.."
WARNING  12/01 12:15:55   Fasta header too long: renamed "gnl|AIT|--prefix_9" to "c00009_gnl|AIT.."
WARNING  12/01 12:15:55   Fasta header too long: renamed "gnl|AIT|--prefix_16" to "c00010_gnl|AIT.."
WARNING  12/01 12:15:55   Fasta header too long: renamed "gnl|AIT|--prefix_17" to "c00011_gnl|AIT.."
WARNING  12/01 12:15:55   Fasta header too long: renamed "gnl|AIT|--prefix_18" to "c00012_gnl|AIT.."
WARNING  12/01 12:15:55   Fasta header too long: renamed "gnl|AIT|--prefix_19" to "c00013_gnl|AIT.."
WARNING  12/01 12:15:55   Fasta header too long: renamed "gnl|AIT|--prefix_20" to "c00014_gnl|AIT.."
WARNING  12/01 12:15:55   Fasta header too long: renamed "gnl|AIT|--prefix_21" to "c00015_gnl|AIT.."
WARNING  12/01 12:15:55   Fasta header too long: renamed "gnl|AIT|--prefix_22" to "c00016_gnl|AIT.."
INFO     12/01 12:15:57   No genes found, skipping record
INFO     12/01 12:15:57   No genes found, skipping record
INFO     12/01 12:15:58   Analysing record: c00001_gnlAIT..
INFO     12/01 12:15:58   Detecting secondary metabolite clusters
INFO     12/01 12:15:58   Running antismash.detection.hmm_detection
INFO     12/01 12:15:58   HMM detection using strictness: relaxed
INFO     12/01 12:15:59   No regions detected, skipping record
INFO     12/01 12:15:59   Analysing record: c00002_gnlAIT..
INFO     12/01 12:15:59   Detecting secondary metabolite clusters
INFO     12/01 12:15:59   Running antismash.detection.hmm_detection
INFO     12/01 12:15:59   HMM detection using strictness: relaxed
INFO     12/01 12:16:00   No regions detected, skipping record
INFO     12/01 12:16:00   Analysing record: c00003_gnlAIT..
INFO     12/01 12:16:00   Detecting secondary metabolite clusters
INFO     12/01 12:16:00   Running antismash.detection.hmm_detection
INFO     12/01 12:16:00   HMM detection using strictness: relaxed
INFO     12/01 12:16:01   No regions detected, skipping record
INFO     12/01 12:16:01   Analysing record: c00004_gnlAIT..
INFO     12/01 12:16:01   Detecting secondary metabolite clusters
INFO     12/01 12:16:01   Running antismash.detection.hmm_detection
INFO     12/01 12:16:01   HMM detection using strictness: relaxed
INFO     12/01 12:16:02   1 region(s) detected in record
INFO     12/01 12:16:02   Running antismash.detection.genefunctions
INFO     12/01 12:16:03   Running antismash.detection.nrps_pks_domains
INFO     12/01 12:16:04   Running antismash.modules.lanthipeptides
INFO     12/01 12:16:04   Running antismash.modules.lassopeptides
INFO     12/01 12:16:04   Running antismash.modules.nrps_pks
INFO     12/01 12:16:04   Running antismash.modules.sactipeptides
INFO     12/01 12:16:04   Running antismash.modules.t2pks
INFO     12/01 12:16:04   Running antismash.modules.thiopeptides
INFO     12/01 12:16:04   Running antismash.modules.tta
INFO     12/01 12:16:04   Skipping TTA codon detection, GC content too low: 49%
INFO     12/01 12:16:04   Analysing record: c00005_gnlAIT..
INFO     12/01 12:16:04   Detecting secondary metabolite clusters
INFO     12/01 12:16:04   Running antismash.detection.hmm_detection
INFO     12/01 12:16:04   HMM detection using strictness: relaxed
INFO     12/01 12:16:04   No regions detected, skipping record
INFO     12/01 12:16:04   Analysing record: c00006_gnlAIT..
INFO     12/01 12:16:04   Detecting secondary metabolite clusters
INFO     12/01 12:16:04   Running antismash.detection.hmm_detection
INFO     12/01 12:16:04   HMM detection using strictness: relaxed
INFO     12/01 12:16:05   No regions detected, skipping record
INFO     12/01 12:16:05   Analysing record: c00009_gnlAIT..
INFO     12/01 12:16:05   Detecting secondary metabolite clusters
INFO     12/01 12:16:05   Running antismash.detection.hmm_detection
INFO     12/01 12:16:05   HMM detection using strictness: relaxed
INFO     12/01 12:16:10   3 region(s) detected in record
INFO     12/01 12:16:10   Running antismash.detection.genefunctions
INFO     12/01 12:16:18   Running antismash.detection.nrps_pks_domains
INFO     12/01 12:16:20   Running antismash.modules.lanthipeptides
INFO     12/01 12:16:20   Running antismash.modules.lassopeptides
INFO     12/01 12:16:20   Running antismash.modules.nrps_pks
INFO     12/01 12:16:20   Predicting A domain substrate specificities with NRPSPredictor2
INFO     12/01 12:16:22   Predicting CAL domain substrate specificities by Minowa et al. method
INFO     12/01 12:16:22   Predicting PKS KR activity and stereochemistry using KR fingerprints from Starcevic et al.
INFO     12/01 12:16:22   Running antismash.modules.sactipeptides
INFO     12/01 12:16:22   Running antismash.modules.t2pks
INFO     12/01 12:16:22   Running antismash.modules.thiopeptides
Traceback (most recent call last):
  File "/mnt/home/benucci/anaconda2/envs/antismash/bin/antismash", line 10, in <module>
    sys.exit(entrypoint())
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/__main__.py", line 125, in entrypoint
    sys.exit(main(sys.argv[1:]))
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/__main__.py", line 113, in main
    antismash.run_antismash(sequence, options)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/main.py", line 674, in run_antismash
    result = _run_antismash(sequence_file, options)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/main.py", line 734, in _run_antismash
    analysis_timings = analyse_record(record, options, get_analysis_modules(), module_results)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/main.py", line 299, in analyse_record
    run_module(record, module, options, previous_result, timings)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/main.py", line 271, in run_module
    results = module.run_on_record(record, results, options)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/modules/thiopeptides/__init__.py", line 88, in run_on_record
    return specific_analysis(record)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/modules/thiopeptides/specific_analysis.py", line 617, in specific_analysis
    result_vec = run_thiopred(thio_feature, thio_type, domains)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/modules/thiopeptides/specific_analysis.py", line 528, in run_thiopred
    result = determine_precursor_peptide_candidate(query, domains)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/modules/thiopeptides/specific_analysis.py", line 459, in determine_precursor_peptide_candidate
    end, score = run_cleavage_site_phmm(thio_a_fasta, 'thio_cleave.hmm', -3.00)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/modules/thiopeptides/specific_analysis.py", line 418, in run_cleavage_site_phmm
    return predict_cleavage_site(profile, input_fasta, threshold)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/modules/thiopeptides/specific_analysis.py", line 334, in predict_cleavage_site
    hmmer_res = subprocessing.run_hmmpfam2(query_hmmfile, target_sequence)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/common/subprocessing/hmmpfam.py", line 39, in run_hmmpfam2
    result = execute(command, stdin=target_sequence)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/common/subprocessing/base.py", line 95, in execute
    stderr == PIPE)
  File "/mnt/home/benucci/anaconda2/envs/antismash/lib/python3.7/site-packages/antismash/common/subprocessing/base.py", line 32, in __init__
    self.stdout = stdout.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 545: invalid start byte
(antismash) [benucci@dev-amd20 antismash_test]$ 

Edit: I believe the >--prefix name thing is the standard for PROKKA I believe,... mmm.. what should I try on doing? Any idea?

Thanks much, G.

SJShaw commented 1 year ago

--prefix is part of the argument you give prokka, it should be something like --prefix something, but maybe in the runs that generated this it ended up --prefix something--prefix? It's certainly not something I've seen in quite a lot of PROKKA outputs.

That it's falling over again in hmmpfam2 doesn't surprise me. Try running the nisin cluster that's shipped with antiSMASH as a test case, in the source tree it's antismash/test/integration/data/nisin.gbk. If that still fails in hmmpfam2, it's definitely your environment at fault, possibly a bad/strange build of the HMMer binary.

If it works, test out the theory that it's gene naming/annotations by not using your GFF annotations, just use prodigal to do the gene finding. Run with --genefinding-tool prodigal instead of --genefinding-gff ... and it will still find that particular thiopeptide cluster with exactly the same genes. If it works after that, then it's your annotations.

Gian77 commented 1 year ago

Hello @SJShaw ,

Well, thanks so much. Prokka output is exactly the problem and in particular, the --prefix. In my original annotation script I was assigning a variable to the --prefix, the strain of the genome. The fact s that for some genomes, I have no strains, so prokka was weridly adding the --prefix to the label of CDS to the files of the genomes with no strain. So werid, I removed the --prefix from the script and now it seems to work ok - I tested on the same offending contig file I sent you.

I guess, I am going to reannotate those contigs that did not work.

Really, thanks so much for helping me out with this!

Gian

SJShaw commented 1 year ago

No problem, I'm glad it's resolved. We might still make some changes to protect against bad names like this.