AlexanderLabWHOI / EUKulele

Automatic eukaryotic taxonomic classification
MIT License
28 stars 7 forks source link

EUKulele Fails on "Performing taxonomic visualization steps..." #41

Open tuck82er opened 2 years ago

tuck82er commented 2 years ago

EUKulele Fails on "Performing taxonomic visualization steps..."

Running EUKulele on the HPC results in a failed run, likely at the taxonomic estimation step. Here is the tail of the batch run output:

Diamond process exited for sample 72. Diamond process exited for sample 10. Performing taxonomic estimation steps... Performing taxonomic visualization steps...

and the output for each sample taxest<sample #>.out gives:

Taxonomic estimation did not complete successfully. Check log file for details.

and tax_vis.out gives :

One of the files, 19.faa, in the sample directory did not complete successfully.

the tax_vis.out output leads me to believe the issue is with estimation causing visualization to fail as no sequences are annotated

also, all taxest<sample #>.err are empty.

My input parameters are as follows:

jobname: eukulele
mets_or_mags: MAGs # answer METs or MAGs
nucleotide_extension: .fna # .fasta
protein_extension: .faa
scratch: /home/etucker5/miniconda3/envs/S-niv-MAGs/data/scratch/eukulele/
database: phylodb
reference: /home/etucker5/miniconda3/envs/eukulele/phylodb/
output: /home/etucker5/miniconda3/envs/S-niv-MAGs/data/output/eukulele/euk_MAGs/
samples: /home/etucker5/miniconda3/envs/S-niv-MAGs/data/output/prodigal/NA-all-M/euk/protein_bin/  
ref_fasta: reference.pep.fa 

# Path for reference taxonomy table and protein JSON file should be relative to reference entry above.
# You can have the script create these automatically using the input FASTA file(s) for your database and a provided
# original taxonomy table. 
tax_table: taxonomy-table.txt  #../tax-table-formatted.txt #tax-table-phylodb.txt #tax-table-phylodb.txt
protein_map: prot-map.json #../protein-map.json #protein-map-phylodb.json #protein-species-map.json 

cutoff: tax-cutoffs.yaml 
consensus_cutoff: 0.75
alignment_choice: diamond # diamond or blast
choose_parallel: series # parallel or series; whether to run estimate taxonomy in parallel mode (Requires joblib & multiprocessing)

# Options for BUSCO assessment
individual_or_summary: individual
organisms:
    - Chlamydomonas reinhardtii
taxonomy_organisms:
    - genus

and the tax-cutoffs.yaml as follows:

species: 95
genus: 80
family: 65
order: 50
class: 30

I'm unsure exactly why this run is failing and am not entirely sure how to diagnose the issue as I've run out of logs to search (I think). Any thoughts suggestions would be welcome!

nvpatin commented 2 years ago

Jumping in on this thread because I am also running into visualization errors! In my case, the tax_vis.err output is as follows:

Traceback (most recent call last): File "/work/hpc/users/nvp29/miniconda3/envs/eukulele/bin/EUKulele", line 4, in import('pkg_resources').run_script('EUKulele==1.0.1', 'EUKulele') File "/work/hpc/users/nvp29/miniconda3/envs/eukulele/lib/python3.6/site-packages/pkg_resources/init.py", line 651, in run_script self.require(requires)[0].run_script(script_name, ns) File "/work/hpc/users/nvp29/miniconda3/envs/eukulele/lib/python3.6/site-packages/pkg_resources/init.py", line 1448, in run_script exec(code, namespace, namespace) File "/work/hpc/users/nvp29/miniconda3/envs/eukulele/lib/python3.6/site-packages/EUKulele-1.0.1-py3.6.egg-info/scripts/EUKulele", line 8, in EUKulele.eukulele(string_arguments=' '.join(sys.argv[1:])) File "/work/hpc/users/nvp29/miniconda3/envs/eukulele/lib/python3.6/site-packages/EUKulele/EUKulele_config.py", line 32, in eukulele EUKulele.EUKulele_main.main(str(string_arguments)) File "/work/hpc/users/nvp29/miniconda3/envs/eukulele/lib/python3.6/site-packages/EUKulele/EUKulele_main.py", line 313, in main level_hierarchy = levels_file) File "/work/hpc/users/nvp29/miniconda3/envs/eukulele/lib/python3.6/site-packages/EUKulele/manage_steps.py", line 88, in manageEukulele use_salmon_counts, rerun_rules, level_hierarchy) File "/work/hpc/users/nvp29/miniconda3/envs/eukulele/lib/python3.6/site-packages/EUKulele/manage_steps.py", line 643, in manageTaxVisualization use_salmon_counts, rerun_rules, level_hierarchy) File "/work/hpc/users/nvp29/miniconda3/envs/eukulele/lib/python3.6/site-packages/EUKulele/visualize_results.py", line 306, in visualize_all_results curr_df_summed = curr_df_start.groupby("Sample")["NumTranscripts"].agg(AllCts='sum') TypeError: aggregate() missing 1 required positional argument: 'func_or_funcs'

I am running EUKulele in a conda environment on a HPC cluster using MAG protein-coding genes (.faa extension) with a pre-downloaded MMETSP database with the following command:

EUKulele -m mags -s MAG-SCG-faas --reference_dir /work/hpc/users/nvp29/databases/mmetsp --CPUs 20

The program seems to be looking for count data, which doesn't exist because the input is MAG .faa files.

akrinos commented 2 years ago

Hi both, and thanks @nvpatin for bumping this thread, since I missed responding before! @nvpatin , could you give me the output of ‘EUKulele --version’ on your system? I dealt with a similar error recently. For @tuck82er , what are the approximate sizes of your files? Thanks again!

alephreish commented 1 year ago

Hey, I'm experiencing the same issue as @tuck82er: diamond finishes successfully, but no output is produced and only tax_vis.out contains one line informing that One of the files, final.contigs.fa, in the sample directory did not complete successfully. Same outcome for phylodb and eukprot.

The size of the (metatranscriptomic) assembly is 1,985,223 sequences with a total length of 1,208,756,176 nt, 1.3GB of non-interleaved fasta on the disk, fasta headers are of the form '>k141_668320 flag=1 multi=4.0000 len=304'.

I have a total 250GB RAM on the server and run EUKulele with --CPUs 20 or 40.

I do manage to get the results for smaller subsets of the input file.

I use EUKulele v. 2.0.0 from pipy (installation via conda proved to be difficult, if not impossible, with strict repo priority).

akrinos commented 1 year ago

@alephreish I will get back to your concerns on memory usage very soon! But as far as installation: we have found that installation is a lot easier with mamba; have you used mamba before?

alephreish commented 1 year ago

@akrinos Yes, I'm using mamba here as well. I noticed inconsistent behavior between different conda installations, so I suspect that it might be a problem on my side - will post an update if I find a solution.

akrinos commented 1 year ago

Hi @alephreish , that's too bad, sorry for telling you something you already know! Looking forward to hearing more about what you find and digging more into the other performance issues soon. Thanks for trying the tool!

alephreish commented 1 year ago

@akrinos The problem with the input size resolved itself after switching to eukulele v. 2.0.3 and diamond v. 0.9.24, not sure what it was.