etetoolkit / ete

Python package for building, comparing, annotating, manipulating and visualising trees. It provides a comprehensive API and a collection of command line tools, including utilities to work with the NCBI taxonomy tree.
http://etetoolkit.org
GNU General Public License v3.0
794 stars 214 forks source link

FATAL: Sequence file contains no data #431

Open Ahmed-Shibl opened 4 years ago

Ahmed-Shibl commented 4 years ago

I've been getting an error recently and not sure how to fix it:

The code I run is:

ete3 build -w standard_trimmed_raxml_bootstrap -a ~/miniconda3/envs/ete3/azer_tree/ncbinr_azer_tree_v1/AzeR_ncbinr_alignment_1_200aa_headers_NEW_4.fa -o ~/miniconda3/envs/ete3/azer_tree/ncbinr_azer_tree_v1 --cpu 60 --noimg

And the error I get is:

Traceback (most recent call last):
  File "/home/as11798/miniconda3/envs/ete3/bin/ete3", line 11, in <module>
    load_entry_point('ete3==3.1.1', 'console_scripts', 'ete3')()
  File "/home/as11798/miniconda3/envs/ete3/lib/python2.7/site-packages/ete3/tools/ete.py", line 95, in main
    _main(sys.argv)
  File "/home/as11798/miniconda3/envs/ete3/lib/python2.7/site-packages/ete3/tools/ete.py", line 146, in _main
    ete_build._main(arguments, builtin_apps_path)
  File "/home/as11798/miniconda3/envs/ete3/lib/python2.7/site-packages/ete3/tools/ete_build.py", line 1102, in _main
    app_wrapper(main, args)
  File "/home/as11798/miniconda3/envs/ete3/lib/python2.7/site-packages/ete3/tools/ete_build_lib/interface.py", line 372, in app_wrapper
    main(None, func, args)
  File "/home/as11798/miniconda3/envs/ete3/lib/python2.7/site-packages/ete3/tools/ete_build_lib/interface.py", line 494, in main
    func(args)
  File "/home/as11798/miniconda3/envs/ete3/lib/python2.7/site-packages/ete3/tools/ete_build.py", line 528, in main
    seqname2seqid = seqio.load_sequences(args, "aa", target_seqs, target_species, seqname2seqid)
  File "/home/as11798/miniconda3/envs/ete3/lib/python2.7/site-packages/ete3/tools/ete_build_lib/seqio.py", line 84, in load_sequences
    estimated_time = ((len(target_seqs)-len(loaded_seqs)) * (time.time()-start_time)) / float(c1)
TypeError: object of type 'NoneType' has no len()

Here's where it gets interesting; when I try to run it again with the --resume flag like this:

ete3 build --resume -w standard_trimmed_raxml_bootstrap -a /home/as11798/miniconda3/envs/ete3/azer_tree/ncbinr_azer_tree_v1/AzeR_ncbinr_alignment_1_200aa_headers_NEW_4.fa -o /home/as11798/miniconda3/envs/ete3/azer_tree/ncbinr_azer_tree_v1 --cpu 60 --noimg

I get this:

INFO -  Starting ETE-build execution at Sun Dec 15 12:19:03 2019
INFO -  Output directory /home/as11798/miniconda3/envs/ete3/azer_tree/ncbinr_azer_tree_v1
WRNG -  Using existing dir: /home/as11798/miniconda3/envs/ete3/azer_tree/ncbinr_azer_tree_v1/tmp
WRNG -  Using existing dir: /home/as11798/miniconda3/envs/ete3/azer_tree/ncbinr_azer_tree_v1/tasks
WRNG -  Using existing dir: /home/as11798/miniconda3/envs/ete3/azer_tree/ncbinr_azer_tree_v1/input
WRNG -  Using existing dir: /home/as11798/miniconda3/envs/ete3/azer_tree/ncbinr_azer_tree_v1/db
WRNG -  Reusing sequences from existing database!
WRNG -  0 target sequences
INFO -  ETE build starts now!
INFO -   Updating tasks status: (Sun Dec 15 12:19:03 2019)
INFO -  Thread clustalo_default-trimal01-none-raxml_default_bootstrap: pending tasks: 1 of sizes: 0
INFO -   (W) MultiSeqTask ( aa seqs, MSF, /clustalo_d..._bootstrap)
INFO -  Launched 0 jobs. 0(R), 0(W). Cores usage: 0/60
INFO -   (D) MultiSeqTask ( aa seqs, MSF, /clustalo_d..._bootstrap)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Sun Dec 15 12:19:05 2019)
INFO -  Thread clustalo_default-trimal01-none-raxml_default_bootstrap: pending tasks: 1 of sizes: 0
INFO -   (W) AlgTask ( aa seqs, Clustal-Omega, /clustalo_d..._bootstrap)
INFO -  Waiting 2 seconds
INFO -  Launched 1 jobs. 1(R), 0(W). Cores usage: 60/60
INFO -   Updating tasks status: (Sun Dec 15 12:19:07 2019)
INFO -  Thread clustalo_default-trimal01-none-raxml_default_bootstrap: pending tasks: 1 of sizes: 0
INFO -   (W) AlgTask ( aa seqs, Clustal-Omega, /clustalo_d..._bootstrap)
ERR  -        Job error reported: Job (clustalo---threads-60, 777dad)
ERR  -        Errors found in AlgTask ( aa seqs, Clustal-Omega, /clustalo_d..._bootstrap)
Traceback (most recent call last):
  File "/home/as11798/miniconda3/envs/ete3/lib/python2.7/site-packages/ete3/tools/ete_build_lib/scheduler.py", line 257, in schedule
    task.status = task.get_status(qstat_jobs)
  File "/home/as11798/miniconda3/envs/ete3/lib/python2.7/site-packages/ete3/tools/ete_build_lib/master_task.py", line 198, in get_status
    self.job_status = self.get_jobs_status(sge_jobs)
  File "/home/as11798/miniconda3/envs/ete3/lib/python2.7/site-packages/ete3/tools/ete_build_lib/master_task.py", line 306, in get_jobs_status
    raise TaskError(j, "Job execution error %s" %errorpath)
TaskError: Job execution error /home/as11798/miniconda3/envs/ete3/azer_tree/ncbinr_azer_tree_v1/tasks/777dad948a1155b2ad6cd4b6e3d9b3dd
INFO -  Waiting 2 seconds
INFO -  Launched 0 jobs. 0(R), 0(W). Cores usage: 0/60
ERR  -  Thread clustalo_default-trimal01-none-raxml_default_bootstrap contains errors:
ERR  -   ** AlgTask ( aa seqs, Clustal-Omega, /clustalo_d..._bootstrap)
ERR  -        -> Job (clustalo---threads-60, 777dad)
ERR  -        -> /home/as11798/miniconda3/envs/ete3/azer_tree/ncbinr_azer_tree_v1/tasks/777dad948a1155b2ad6cd4b6e3d9b3dd
ERR  -          -> Job execution error /home/as11798/miniconda3/envs/ete3/azer_tree/ncbinr_azer_tree_v1/tasks/777dad948a1155b2ad6cd4b6e3d9b3dd
ERR  -  Done with ERRORS

Data Error: Errors found in some tasks

Contents of the stderr file in /tasks is: FATAL: Sequence file contains no data

Contents of the file I'm using, when running head or more or less, is:

AzeR|Gene MSDEVEDSSNPKDRKYVEALARGLDVLRAFTHGSVVLGNQEISRITGLPKATVSRMTYTL TQLGYLCYSQQHEKYQLDSGVLALGYAYVSNLRVRQLAKPYMDAFARRTNTTVGLTCRDW LSMIYVENCRPPEATSLRMDAGVRLPLATTAAGRAYLAATPEQEREHLLSALQERHEGDW SVMRASLEASFEEFRQHGFCLSLGDWDRNVRAAGVPLRLADGGLMALTCGAPSFQLSEET LRGSLAHELEILARDIESLGA F5_fig|1402135.21.peg.1764 MDKAFIKGLRLIEALAHSEKPRGVTELAAELGLTKSNVHRLLATLVAQGYVHQDPQYSTY ALGTKIWELGSHVIRRLDLTKVARPAMERLAALTGETVHLSVLDDMDVVYLDKIESSHHI RAHTHVGQRAPAYTMATGKAMLARMPDAYLERYHNRFQSFTPTTITTMDQLHRAIEEVRA

So clearly the file isn't empty it's just not being read by ete3...If anyone has an idea of what this could be I would really appreciate it. Thanks

Ahmed

jhcepas commented 4 years ago

is the sequence file in fasta format?

>AzeR|Gene MSDEVEDSSNPKDRKYVEALARGLDVLRAFTHGSVVLGNQEISRITGLPKATVSRMTYTL TQLGYLCYSQQHEKYQLDSGVLALGYAYVSNLRVRQLAKPYMDAFARRTNTTVGLTCRDW LSMIYVENCRPPEATSLRMDAGVRLPLATTAAGRAYLAATPEQEREHLLSALQERHEGDW SVMRASLEASFEEFRQHGFCLSLGDWDRNVRAAGVPLRLADGGLMALTCGAPSFQLSEET LRGSLAHELEILARDIESLGA >F5_fig|1402135.21.peg.1764 MDKAFIKGLRLIEALAHSEKPRGVTELAAELGLTKSNVHRLLATLVAQGYVHQDPQYSTY ALGTKIWELGSHVIRRLDLTKVARPAMERLAALTGETVHLSVLDDMDVVYLDKIESSHHI RAHTHVGQR

Ahmed-Shibl commented 4 years ago

Yes...I even put through SeqKit earlier to remove header names..

jhcepas commented 4 years ago

can you upload the input file?

Ahmed-Shibl commented 4 years ago

Could the problem be that it's massive? as in >100,000 seqs?

Ahmed-Shibl commented 4 years ago

AzeR_ncbinr_alignment_1_200aa_headers_NEW_4.fa.zip

AaronBlare commented 4 years ago

I have similar error, when try to load ~12.000 seqs. When I reduce number of seqs to 10.000 - everything works.

Is there any limit to number of loaded seqs? @jhcepas @Ahmed-Shibl Does the solution for this problem exist?