AlexanderLabWHOI / EUKulele

Automatic eukaryotic taxonomic classification
MIT License
28 stars 7 forks source link

EUKulele not recognizing downloaded db? #46

Closed nvpatin closed 2 years ago

nvpatin commented 2 years ago

I am trying to run EUKulele on an HPC using a batch script. The compute nodes don't have internet connectivity, so I downloaded the PhyloDB database to the following location: /work/nvp29/databases/phyolodb. The PhyloDB files (unzipped) are phylodb_1.076.annotations.txt, phylodb_1.076.pep.fa, and phylodb_1.076.taxonomy.txt.

Similar to another user, EUKulele is not recognizing the downloaded database and is trying to download the database.

I installed EUKulele in a conda environment, and in my batch script I activate the environment then run EUKulele --config /work/nvp29/databases/phylodb/config-hybrid.yaml. The config file is attached here.

This job gives me the following errors in the log .out file: All reference files for PhyloDB downloaded to /work/nvp29/sbatch-scripts/phylodb Running EUKulele with entries from the provided configuration file. No BUSCO file specified/found; using argument-specified organisms and taxonomy for BUSCO analysis. Setting things up... Could not successfully install all external dependent software. Check DIAMOND, BLAST, BUSCO, and TransDecoder installation. ['1903c119_11m_orfs', '1903c117_50m_orfs', '1903c124_15m_orfs', '1903c111_10m_orfs', '1903c144_13m_orfs', '1903c126_45m_orfs', '1903c122$ Specified reference directory, reference FASTA, and protein map/taxonomy table not found. Using database in location: ./phylodb. Automatically downloading database phylodb . If you intended to use an existing database folder, be sure a reference FASTA, protein map, and taxonomy table are provided. Check the documentation for details.

And the following errors in the log .err file: wget: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory gzip: ./phylodb/reference.pep.fa.gz: No such file or directory wget: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory gzip: ./phylodb/taxonomy-table.txt.gz: No such file or directory

It seems like EUKulele is still trying to automatically download the PhyloDB database and failing because it has no internet connectivity. However, in the config file I provide the database directory and the reference fasta file. Is there something wrong in the format of the config file?

Additionally, there might be issues with the DIAMOND, BLAST, BUSCO, and TransDecoder software. Are these not all included in the conda installation? I ran 'conda update eukulele' to get the most recent version.

config-hybrid.yaml.txt

nvpatin commented 2 years ago

Update: I had to install BUSCO and transdecoder separately despite running the conda install for EUKulele. That issue has been resolved; however, I still can't get EUKulele to recognize the pre-downloaded phylodb. I tried using the command without a configuration file as follows and got the below error output:

EUKulele -m mets --sample_dir /work/nvp29/Lasker_2019/PacBio/07.hybridSPAdes/prodigal/faas --scratch tmp -o eukulele_output-hybrid --reference_dir /work/nvp29/databases/phylodb/ --ref_fasta phylodb_1.076.pep.fa --alignment_choice diamond --tax_table phylodb_1.076.taxonomy.txt

All reference files for MarRef-MMETSP downloaded to /work/nvp29/databases/phylodb/marmmetsp Running EUKulele with command line arguments, as no valid configuration file was provided. Setting things up... ['Las19c107_10m_orfs', '1903c124_15m_orfs', '1903c117_50m_orfs', '1903c111_10m_orfs', 'Las19c135_5m_orfs', '1903c126_45m_orfs', '1903c144_13m_orfs', '1903c119_11m_orfs', '1903c118_23m_orfs', '1903c129_26m_orfs', '1903c123_10m_orfs', '1903c122_28m_orfs', 'Las19c138_27m-1_orfs', '1903c138_7m_orfs'] Specified reference directory, reference FASTA, and protein map/taxonomy table not found. Using database in location: /work/nvp29/databases/phylodb/marmmetsp. Automatically downloading database marmmetsp . If you intended to use an existing database folder, be sure a reference FASTA, protein map, and taxonomy table are provided. Check the documentation for details.

akrinos commented 2 years ago

Hi @nvpatin ! Thanks for trying out EUKulele, and for the updates. Just to make totally sure, could you run EUKulele --version so I can make sure we're both looking at the same code? It looks like what's happening is that it's looking in the reference directory for the database called marmmetsp (as in looking for the default)...so also a quick check is:

EUKulele -m mets --sample_dir /work/nvp29/Lasker_2019/PacBio/07.hybridSPAdes/prodigal/faas --scratch tmp -o eukulele_output-hybrid --reference_dir /work/nvp29/databases --database phylodb --ref_fasta phylodb_1.076.pep.fa --alignment_choice diamond --tax_table phylodb_1.076.taxonomy.txt

To see if you specify PhyloDB as the database and go up a level for the reference directory, based on the output it's giving you. But, we'll check the version first, since the way you have it written should be no problem!

nvpatin commented 2 years ago

Thank you for the quick response! The current version of EUKulele is 2.0.3.

akrinos commented 2 years ago

Hi @nvpatin; I just realized that you don't have a protein map file specified for the run? I stepped through the code and then checked back to the files you originally listed. The protein map file is specific to EUKulele. There are two options for getting that generated. You could run the create_protein_table.py script also packaged within EUKulele, which can be invoked with:

create_protein_table.py
    --infile_peptide <path/to/reference.pep.fa>
    --infile_taxonomy <path/to/tax-table.txt>
    --outfile_json <path/to/new/prot-map.json>
    --output <path/to/new/taxonomy-table.txt>
    --delim "\t" --strain_col_id strain_name --taxonomy_col_id taxonomy --column 2

With the new protein map and new taxonomy table files to be specified therein; the above format should work okay for PhyloDB. Or, you could download PhyloDB within EUKulele by specifying that as the database keyword. Let me know if that's not it and you already have a protein map!

nvpatin commented 2 years ago

Hi @akrinos,

Ah, I definitely missed this part of the work flow. I generated the protein map file as you suggested with the above command, but I had to change the '--strain_col_id' parameter to '--col_source_id' before it worked.

Then I tried the below command with the new files as inputs, but still got an error saying the database wasn't found.

EUKulele -m mets --sample_dir /work/nvp29/Lasker_2019/PacBio/07.hybridSPAdes/prodigal/faas --scratch tmp -o eukulele_output-hybrid --reference_dir /work/nvp29/databases --database phylodb --ref_fasta phylodb_1.076.pep.fa --alignment_choice diamond --tax_table tax-table-phylodb.txt --protein_map protein-map-phylodb.json

All reference files for PhyloDB downloaded to /work/nvp29/sbatch-scripts/phylodb Running EUKulele with command line arguments, as no valid configuration file was provided. Setting things up... ['1903c123_10m_orfs', 'Las19c135_5m_orfs', '1903c117_50m_orfs', '1903c124_15m_orfs', '1903c127_7m_orfs', '1903c111_10m_orfs', '1903c122_28m_orfs', '1903c144_13m_orfs', 'Las19c138_27m-1_orfs', '1903c118_23m_orfs', 'Las19c107_10m_orfs', ...] Specified reference directory, reference FASTA, and protein map/taxonomy table not found. Using database in location: /work/nvp29/databases/phylodb. Automatically downloading database phylodb . If you intended to use an existing database folder, be sure a reference FASTA, protein map, and taxonomy table are provided. Check the documentation for details.

I also tried this with the config file after adding in the new table names, but got basically the same error. :(

nvpatin commented 2 years ago

Hi @akrinos,

I got EUKulele to work with the pre-downloaded PhyloDB using the following command:

EUKulele -m mets --sample_dir /work/nvp29/Lasker_2019/PacBio/07.hybridSPAdes/prodigal/faas --scratch tmp -o /work/nvp29/Lasker_2019/PacBio/09.EUKulele/eukulele_output-hybrid --reference_dir /work/nvp29/databases/phylodb --ref_fasta phylodb_1.076.pep.fa --alignment_choice diamond --tax_table tax-table-phylodb.txt --protein_map protein-map-phylodb.json

It seems like the --database parameter should not be used unless you want the program to download the database of choice.

I did get an error saying one of my amino acid files wasn't valid, but it looks like I still got a good DIAMOND output file for it. Here are the first few lines of that file, which looks the same as all of the other .faa files I provided:

(base) nvp29@Shadow-login-1:/work/nvp29/Lasker_2019/PacBio/07.hybridSPAdes/prodigal/faas$ head Las19c107_10m_orfs.faa

NODE_1_length_30165_cov_19.413584_1 # 3 # 188 # -1 # ID=1_1;partial=10;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.473 MNDGEPLSITVLTFGPLAEKLGWKRKNYSVRQHASVSEVVESIGLTSIQQKGLLFAVNGL QC NODE_1_length_30165_cov_19.413584_2 # 240 # 935 # 1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.490 MGPVMKATGVWKIYPSGESTVQAVRGVDVSIDAGEMVAIMGASGCGKTTLLNILSGIDEP NSGDVHVNGEPLFGISDNKRTRMRAEYLGFIFQDFNLLPVLSAVENVELPLLLLGKSASE ARKGALEALQSVGLSQRSEHLPSELSGGQQQRVAVARALVHNPTVILCDEPTGNLDSVTS AEVLELLHKLNRERNTTFLIVTHDAMIAKRCTRTLQMLDGTIVEDRRNEEE* NODE_1_length_30165_cov_19.413584_3 # 950 # 5542 # 1 # ID=1_3;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.478 MLSLFVMVGLLSAVSTFGIGRWKQLGALRVLLLIVGPIIDAYIAYYFLHWISVSGATLWA

More importantly, the 'taxonomy_estimation' folder is empty, and the output log files all say "Taxonomic estimation did not complete successfully. Check log file for details." The "tax_est" .err files are empty. Where should I look to find the details of the failure?

Thanks, Nastassia

The full batch script log file:

Running EUKulele with command line arguments, as no valid configuration file was provided. Setting things up... ['1903c127_7m_orfs', '1903c118_23m_orfs', '1903c144_13m_orfs', '1903c123_10m_orfs', '1903c119_11m_orfs', '1903c111_10m_orfs', '1903c122_28m_orfs', '1903c129_26m_orfs', 'Las19c138_27m-1_orfs', ...] Found database folder for /work/nvp29/databases/phylodb in current directory; will not re-download. Creating a diamond reference from database files... Aligning to reference database... Peptide extension used, but this file, /work/nvp29/Lasker_2019/PacBio/07.hybridSPAdes/prodigal/faas/Las19c107_10m_orfs.faa, does not appear to be a peptide file. Aligning sample 1903c127_7m_orfs... Aligning sample 1903c118_23m_orfs... Diamond process exited for sample 1903c118_23m_orfs. Aligning sample 1903c144_13m_orfs... Diamond process exited for sample 1903c127_7m_orfs. Aligning sample 1903c123_10m_orfs... Diamond process exited for sample 1903c144_13m_orfs. Aligning sample 1903c119_11m_orfs... Diamond process exited for sample 1903c123_10m_orfs. Aligning sample 1903c111_10m_orfs... Diamond process exited for sample 1903c119_11m_orfs. Aligning sample 1903c122_28m_orfs... Diamond process exited for sample 1903c111_10m_orfs. Aligning sample 1903c129_26m_orfs... Diamond process exited for sample 1903c122_28m_orfs. Aligning sample Las19c138_27m-1_orfs... Diamond process exited for sample 1903c129_26m_orfs. Aligning sample 1903c117_50m_orfs... Diamond process exited for sample 1903c117_50m_orfs. Aligning sample Las19c107_10m_orfs... Diamond process exited for sample Las19c138_27m-1_orfs. Aligning sample Las19c135_5m_orfs... Diamond process exited for sample Las19c107_10m_orfs. Aligning sample 1903c124_15m_orfs... Diamond process exited for sample Las19c135_5m_orfs. Aligning sample 1903c126_45m_orfs... Diamond process exited for sample 1903c124_15m_orfs. Diamond process exited for sample 1903c126_45m_orfs. Performing taxonomic estimation steps... Performing taxonomic visualization steps...

akrinos commented 2 years ago

Hi @nvpatin thanks for the follow-up! Yes, --database will create a subfolder within the database folder if specified; apologies for not getting back to you sooner on that!

The log files for taxonomy estimation is something where presently it's working locally for me but not working in the version on bioconda; if we aren't able to troubleshoot this I can deploy a new version and hopefully it will give you a more helpful error message. The reason sometimes the proper error message is not returned is a problem with message relay inside of the parallelism.

The first things I would check with a taxonomy estimation error are memory consumption and problems with the protein map and taxonomy table. Since it looks like there was a return to the main method, it looks unlikely to be memory issues. So, I would check that your tax table is in tab-separated format and has the Source_ID column you expect given the protein map. If you could like copy some piece of the JSON protein map file, I could also troubleshoot from that for PhyloDB.

If it's none of these things, we can try to figure out why the error reporting isn't working!

nvpatin commented 2 years ago

Here is the first part of the protein map JSON file: {"Aazo_0002-NC_014248": "'Nostoc azollae' 0708", "Aazo_0003-NC_014248": "'Nostoc azollae' 0708", "Aazo_0004-NC_014248": "'Nostoc azollae' 0708", ....

And here are the first few lines of the taxonomy table, which is tab-separated. I don't see a Source_ID column in there so maybe it's not formatted correctly? Screen Shot 2022-08-03 at 11 08 28 AM

akrinos commented 2 years ago

Hi @nvpatin , sorry for the delay! Yes, looks like the problem here is twofold - the taxonomic labels in the tax table should be split rather than ";"-separated, and the Source_ID column is missing for parsing. When you ran the script for creating the protein map, was another taxonomy table generated with that? Otherwise I am happy to just send over the PhyloDB tax table, which should be complete and work with everything else you're using, such that you don't need to reformat. Sorry for all the headaches!

nvpatin commented 2 years ago

This was the only taxonomy table generated with create_protein_table.py. PhyloDB came with its own taxonomy table (phylodb_1.076.taxonomy.txt) and those first few lines are below; I don't see a Source_ID column in there either though Screen Shot 2022-08-05 at 8 14 23 PM .

akrinos commented 2 years ago

Hi @nvpatin - I'll have to check on the PhyloDB download for EUKulele! In the meantime, perhaps you can try the attached taxonomy table?

tax-table-phylodb.txt

nvpatin commented 2 years ago

I ran EUKulele with the tax table you provided and unfortunately I got all the same errors as I originally mentioned. I attached the log files here in case they are helpful. log.zip

akrinos commented 2 years ago

Hi @nvpatin - so sorry, I did this same thing a few days ago for someone else - I sent you the comma-separated table instead of tab-separated - could you try this one?

tax-table-phylodb-tab.txt

nvpatin commented 2 years ago

Thank you @akrinos, this solved the taxonomic estimation and visualization problem!

Now it seems that BUSCO did not successfully run. When I checked the version in the EUKulele conda environment, I saw the following: "There was a problem installing BUSCO or importing one of its dependencies. See the user guide and the GitLab issue board (https://gitlab.com/ezlab/busco/issues) if you need further assistance."

I'll dig into this and let you know what I find, hopefully it's an easy fix. Thanks so much for getting me through these first few hurdles! I can open another issue regarding BUSCO if necessary.

akrinos commented 2 years ago

Hi @nvpatin - sorry I totally missed this! For the BUSCO issues, that's kind of a whole other set of problems people run into, so I'm going to close this thread and you are welcome to open another if needed. Also do note that we now have a --no_busco flag if you aren't actually looking for the BUSCO results - essentially what this does is just to look at completeness at different taxonomic levels using BUSCO, but if you have taxonomy estimation results that will suffice for taxonomic annotation/LCA. Sorry again for the late response!