VIB-PSB / ksrates

ksrates is a tool to position whole-genome duplications relative to speciation events using substitution-rate-adjusted mixed paralog-ortholog Ks distributions.
https://ksrates.readthedocs.io
GNU General Public License v3.0
15 stars 9 forks source link

ERROR Unexpected internal error during analysis of gene family GF_000001 #48

Closed jwang147 closed 1 year ago

jwang147 commented 1 year ago

Hello, when I run in test data and my data, I get an error at the paralogs-ks step like that, could you give some advice. Thank you very much.

ksrates paralogs-ks config_elaeis.txt --n-threads=20

INFO    - - - - - - - - - - - - - - - - - - - - - 
INFO    Paralog wgd analysis for species [elaeis]
INFO    Tue Apr 11 16:51:53 2023
INFO    - - - - - - - - - - - - - - - - - - - - - 
INFO    Checking if sequence data files exist and if sequence IDs are compatible with wgd pipeline...
INFO    Completed
INFO    Running wgd paralog Ks pipeline...
INFO    ---
INFO    Checking external software...
INFO    makeblastdb: 2.12.0+
INFO    blastp: 2.12.0+
INFO    mcl 14-137
INFO    muscle 5.1.linux64 []
INFO    AAML in paml version 4.9, March 2015
INFO    Usage for FastTree version 2.1.11 Double precision (No SSE3):
INFO    Creating output directory /home/jwang/my_apps/ksrates/test/paralog_distributions/wgd_elaeis
INFO    Translating CDS file elaeis.fasta...
INFO    ---
INFO    Running all versus all Blastp
INFO    Writing protein Blastdb sequences to /home/jwang/my_apps/ksrates/test/paralog_distributions/wgd_elaeis/elaeis.blast_tmp/...
INFO    Writing protein query sequences to /home/jwang/my_apps/ksrates/test/paralog_distributions/wgd_elaeis/elaeis.blast_tmp/...
INFO    Performing all versus all Blastp (this might take a while)...
INFO    Making Blastdb
INFO    makeblastdb -in /home/jwang/my_apps/ksrates/test/paralog_distributions/wgd_elaeis/elaeis.blast_tmp/elaeis.db.fasta -dbtype prot
INFO    makeblastdb output:
Building a new DB, current time: 04/11/2023 16:51:56
New DB name:   /home/jwang/my_apps/ksrates/test/paralog_distributions/wgd_elaeis/elaeis.blast_tmp/elaeis.db.fasta
New DB title:  /home/jwang/my_apps/ksrates/test/paralog_distributions/wgd_elaeis/elaeis.blast_tmp/elaeis.db.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 500 sequences in 0.0105121 seconds.
INFO    Running Blastp
INFO    blastp -db /home/jwang/my_apps/ksrates/test/paralog_distributions/wgd_elaeis/elaeis.blast_tmp/elaeis.db.fasta -query /home/jwang/my_apps/ksrates/test/paralog_distributions/wgd_elaeis/elaeis.blast_tmp/elaeis.query.fasta -evalue 1e-10 -outfmt 6 -num_threads 20 -out /home/jwang/my_apps/ksrates/test/paralog_distributions/wgd_elaeis/elaeis.blast.tsv
INFO    All versus all Blastp done
INFO    Removing tmp directory
INFO    ---
INFO    Running gene family construction (MCL clustering with inflation factor = 2.0)
INFO    Started MCL clustering (mcl)
INFO    ---
INFO    Running whole paranome Ks analysis...
INFO    Started analysis of 66 gene families in parallel using 20 threads
INFO    Performing analysis on gene family GF_000001 (size 13)
INFO    Performing analysis on gene family GF_000002 (size 9)
INFO    Performing analysis on gene family GF_000003 (size 6)
INFO    Performing analysis on gene family GF_000004 (size 6)
ERROR   Unexpected internal error during analysis of gene family GF_000001:
Traceback (most recent call last):
  File "/home/jwang/miniconda3/envs/wgd/lib/python3.7/site-packages/wgd_ksrates/ks_distribution.py", line 280, in analyse_family_try_except
    n_families, is_last_family)
  File "/home/jwang/miniconda3/envs/wgd/lib/python3.7/site-packages/wgd_ksrates/ks_distribution.py", line 371, in analyse_family
    msa_path, stats, successful = prepare_aln(msa_path_protein, nucleotide)
  File "/home/jwang/miniconda3/envs/wgd/lib/python3.7/site-packages/wgd_ksrates/alignment.py", line 43, in prepare_aln
    with open(msa_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/jwang/my_apps/ksrates/test/paralog_distributions/wgd_elaeis/elaeis.ks_tmp/GF_000001.fasta.msa'
ERROR   Skipping gene family
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Unexpected internal error during analysis of gene family GF_000003:
Traceback (most recent call last):
  File "/home/jwang/miniconda3/envs/wgd/lib/python3.7/site-packages/wgd_ksrates/ks_distribution.py", line 280, in analyse_family_try_except
    n_families, is_last_family)
  File "/home/jwang/miniconda3/envs/wgd/lib/python3.7/site-packages/wgd_ksrates/ks_distribution.py", line 371, in analyse_family
    msa_path, stats, successful = prepare_aln(msa_path_protein, nucleotide)
  File "/home/jwang/miniconda3/envs/wgd/lib/python3.7/site-packages/wgd_ksrates/alignment.py", line 43, in prepare_aln
    with open(msa_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/jwang/my_apps/ksrates/test/paralog_distributions/wgd_elaeis/elaeis.ks_tmp/GF_000003.fasta.msa'
ERROR   Skipping gene family
ERROR   Unexpected internal error during analysis of gene family GF_000002:
Traceback (most recent call last):
  File "/home/jwang/miniconda3/envs/wgd/lib/python3.7/site-packages/wgd_ksrates/ks_distribution.py", line 280, in analyse_family_try_except
    n_families, is_last_family)
  File "/home/jwang/miniconda3/envs/wgd/lib/python3.7/site-packages/wgd_ksrates/ks_distribution.py", line 371, in analyse_family
    msa_path, stats, successful = prepare_aln(msa_path_protein, nucleotide)
  File "/home/jwang/miniconda3/envs/wgd/lib/python3.7/site-packages/wgd_ksrates/alignment.py", line 43, in prepare_aln
    with open(msa_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/jwang/my_apps/ksrates/test/paralog_distributions/wgd_elaeis/elaeis.ks_tmp/GF_000002.fasta.msa'
ERROR   Skipping gene family
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Unexpected internal error during analysis of gene family GF_000004:
Traceback (most recent call last):
  File "/home/jwang/miniconda3/envs/wgd/lib/python3.7/site-packages/wgd_ksrates/ks_distribution.py", line 280, in analyse_family_try_except
    n_families, is_last_family)
  File "/home/jwang/miniconda3/envs/wgd/lib/python3.7/site-packages/wgd_ksrates/ks_distribution.py", line 371, in analyse_family
    msa_path, stats, successful = prepare_aln(msa_path_protein, nucleotide)
  File "/home/jwang/miniconda3/envs/wgd/lib/python3.7/site-packages/wgd_ksrates/alignment.py", line 43, in prepare_aln
    with open(msa_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/jwang/my_apps/ksrates/test/paralog_distributions/wgd_elaeis/elaeis.ks_tmp/GF_000004.fasta.msa'
ERROR   Skipping gene family
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   Too many gene family analyses failed, terminating threads...
ERROR   --
ERROR   The analyses of more than 1% of gene families [4/66] have failed due to unexpected internal errors
ERROR   Please check the nature of the error(s), remove the tmp directory [/home/jwang/my_apps/ksrates/test/paralog_distributions/wgd_elaeis/elaeis.ks_tmp] and rerun the Ks analysis
ERROR   See the tracebacks above for the following gene family IDs:
ERROR   GF_000001, GF_000002, GF_000003, GF_000004
ERROR   Exiting
jwang147 commented 1 year ago

$ ksrates paralogs-ks config_Dini.txt --n-threads=20

INFO    Paralog wgd analysis for species [Dini]
INFO    Tue Apr 11 16:00:41 2023
INFO    - - - - - - - - - - - - - - - - - - - - 
INFO    Checking if sequence data files exist and if sequence IDs are compatible with wgd pipeline...
INFO    Completed
INFO    Creating directory [paralog_distributions/]
INFO    Running wgd paralog Ks pipeline...
INFO    ---
INFO    Checking external software...
INFO    makeblastdb: 2.12.0+
INFO    blastp: 2.12.0+
INFO    mcl 14-137
INFO    muscle 5.1.linux64 []
INFO    AAML in paml version 4.9, March 2015
INFO    FastTree Version 2.1.11 Double precision (No SSE3)
INFO    Creating output directory /data/jwang/1.2.ksrate/paralog_distributions/wgd_Dini
INFO    Translating CDS file Dini.cds...
WARNING Sequence length != multiple of 3 for Chr01G001412.mRNA1!
WARNING Invalid codon   G in Chr01G001412.mRNA1
WARNING Sequence length != multiple of 3 for Chr02G004296.mRNA1!
WARNING Invalid codon   G in Chr02G004296.mRNA1
WARNING Sequence length != multiple of 3 for Chr03G004056.mRNA1!
WARNING Invalid codon   G in Chr03G004056.mRNA1
WARNING Sequence length != multiple of 3 for Chr05G004423.mRNA1!
WARNING Invalid codon  TT in Chr05G004423.mRNA1
WARNING Sequence length != multiple of 3 for Chr05G004423.mRNA2!
WARNING Invalid codon  TT in Chr05G004423.mRNA2
WARNING Sequence length != multiple of 3 for Chr07G000260.mRNA1!
WARNING Invalid codon   C in Chr07G000260.mRNA1
WARNING There were 12 warnings during translation
INFO    ---
INFO    Running all versus all Blastp
INFO    Writing protein Blastdb sequences to /data/jwang/1.2.ksrate/paralog_distributions/wgd_Dini/Dini.blast_tmp/...
INFO    Writing protein query sequences to /data/jwang/1.2.ksrate/paralog_distributions/wgd_Dini/Dini.blast_tmp/...
INFO    Performing all versus all Blastp (this might take a while)...
INFO    Making Blastdb
INFO    makeblastdb -in /data/jwang/1.2.ksrate/paralog_distributions/wgd_Dini/Dini.blast_tmp/Dini.db.fasta -dbtype prot
INFO    makeblastdb output:
Building a new DB, current time: 04/11/2023 16:00:53
New DB name:   /data/jwang/1.2.ksrate/paralog_distributions/wgd_Dini/Dini.blast_tmp/Dini.db.fasta
New DB title:  /data/jwang/6-dini_genome/1.2.ksrate/paralog_distributions/wgd_Dini/Dini.blast_tmp/Dini.db.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 46825 sequences in 0.830938 seconds.
INFO    Running Blastp
INFO    blastp -db /data/jwang/1.2.ksrate/paralog_distributions/wgd_Dini/Dini.blast_tmp/Dini.db.fasta -query /data/jwang/6-dini_genome/1.2.ksrate/paralog_distributions/wgd_Din
i/Dini.blast_tmp/Dini.query.fasta -evalue 1e-10 -outfmt 6 -num_threads 20 -out /data/jwang/1.2.ksrate/paralog_distributions/wgd_Dini/Dini.blast.tsv
INFO    All versus all Blastp done
INFO    Removing tmp directory
INFO    ---
INFO    Running gene family construction (MCL clustering with inflation factor = 2.0)
INFO    Started MCL clustering (mcl)
INFO    ---
INFO    Running whole paranome Ks analysis...
WARNING Filtered out the largest gene family because its size is > 200
WARNING If you want to analyse this large family anyhow, please raise the `max_gene_family_size` parameter
INFO    Started analysis of 7449 gene families in parallel using 20 threads
INFO    Performing analysis on gene family GF_000002 (size 194)
ERROR   Unexpected internal error during analysis of gene family GF_000002:
Traceback (most recent call last):
  File "/home/jwang/miniconda3/envs/wgd/lib/python3.7/site-packages/wgd_ksrates/ks_distribution.py", line 280, in analyse_family_try_except
    n_families, is_last_family)
  File "/home/jwang/miniconda3/envs/wgd/lib/python3.7/site-packages/wgd_ksrates/ks_distribution.py", line 371, in analyse_family
    msa_path, stats, successful = prepare_aln(msa_path_protein, nucleotide)
  File "/home/jwang/miniconda3/envs/wgd/lib/python3.7/site-packages/wgd_ksrates/alignment.py", line 43, in prepare_aln
    with open(msa_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/data/jwang/1.2.ksrate/paralog_distributions/wgd_Dini/Dini.ks_tmp/GF_000002.fasta.msa'
ERROR   Skipping gene family
INFO    Performing analysis on gene family GF_000003 (size 120)
ERROR   Unexpected internal error during analysis of gene family GF_000003:
Cecilia-Sensalari commented 1 year ago

Hi! Thanks for reaching out. It seems that there is a problem with running the multiple sequence alignment step, as the output files (e.g. "GF_000001.fasta.msa") don't exist:

File "/home/jwang/miniconda3/envs/wgd/lib/python3.7/site-packages/wgd_ksrates/alignment.py", line 43, in prepare_aln
with open(msa_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory:
'/home/jwang/my_apps/ksrates/test/paralog_distributions/wgd_elaeis/elaeis.ks_tmp/GF_000001.fasta.msa'
ERROR Skipping gene family

I don't know what causes it exactly, but it might be a version issue: you're using MUSCLE version 5.1.linux64, while I tested with MUSCLE v3.8.31. Would it be possible to try using this other version? Note that in case installing the required software is a problem, you can make use of a Docker or Singularity container.

To be ahead of other versioning issues, here are other version differences that might cause troubles:

jwang147 commented 1 year ago

@Cecilia-Sensalari Thank you both for your reply! I reinstalled the ksrates software and replaced the blast with version 2.5.0. Now it's working.

Cecilia-Sensalari commented 1 year ago

No problem! Good to know :)