AlexanderLabWHOI / EUKulele

Automatic eukaryotic taxonomic classification
MIT License
28 stars 7 forks source link

EUKulele is making a Diamond database even though it already exists #44

Open jolespin opened 2 years ago

jolespin commented 2 years ago

I already downloaded the eukprot database using EUKulele download --database eukprot which includes the diamond database:

(eukulele_env) -bash-4.2$ ls -lh /usr/local/scratch/CORE/jespinoz/db/eukulele/eukprot/
total 14G
drwxr-xr-x 2 jespinoz tigr   36 Mar 10 16:18 diamond
drwxr-xr-x 2 jespinoz tigr    0 Mar 10 19:28 proteins
-rw-r--r-- 1 jespinoz tigr 3.2G Mar 10 21:17 prot-map.json
-rw-r--r-- 1 jespinoz tigr 9.0G Mar 10 21:17 reference.pep.fa
-rw-r--r-- 1 jespinoz tigr 294K Mar 10 21:17 tax-table.txt
(eukulele_env) -bash-4.2$ ls -lh /usr/local/scratch/CORE/jespinoz/db/eukulele/eukprot/diamond/
total 4.3G
-rw-r--r-- 1 jespinoz tigr 3.7G Mar 10 22:06 reference.pep.dmnd

but when I run EUKulele, it wants to recreate the database:

(eukulele_env) -bash-4.2$ EUKulele -m mags --p_ext faa -f --scratch tmp -o eukulele_output --sample_dir S005_R2_POST-PE-N728-S516-1_S23/output/genomes/ --reference_dir /usr/local/scratch/CORE/jespinoz/db/eukulele/eukprot/ --alignment_choice diamond
Running EUKulele with command line arguments, as no valid configuration file was provided.
Setting things up...
Could not successfully install all external dependent software.
Check DIAMOND, BLAST, BUSCO, and TransDecoder installation.
Found database folder for /usr/local/scratch/CORE/jespinoz/db/eukulele/eukprot/ in current directory; will not re-download.
Creating a diamond reference from database files...
Aligning to reference database...
Aligning sample S005_R2_POST-PE-N728-S516-1_S23__METABAT2__E.1__bin.5...
Aligning sample S005_R2_POST-PE-N728-S516-1_S23__METABAT2__E.1__bin.12...
Aligning sample S005_R2_POST-PE-N728-S516-1_S23__METABAT2__E.1__bin.1...
Aligning sample S005_R2_POST-PE-N728-S516-1_S23__METABAT2__E.1__bin.9...
Aligning sample S005_R2_POST-PE-N728-S516-1_S23__METABAT2__E.1__bin.10...
Diamond process exited for sample S005_R2_POST-PE-N728-S516-1_S23__METABAT2__E.1__bin.9.
Diamond process exited for sample S005_R2_POST-PE-N728-S516-1_S23__METABAT2__E.1__bin.12.
Diamond process exited for sample S005_R2_POST-PE-N728-S516-1_S23__METABAT2__E.1__bin.5.
Diamond process exited for sample S005_R2_POST-PE-N728-S516-1_S23__METABAT2__E.1__bin.1.
Diamond process exited for sample S005_R2_POST-PE-N728-S516-1_S23__METABAT2__E.1__bin.10.
Performing taxonomic estimation steps...
Performing taxonomic visualization steps...

How can I avoid this redundancy?

Also, a few more points to note that I found. It looks like the database folder is forced into lowercase.

Lastly, it is looking for marmmetsp for some reason:

Traceback (most recent call last):
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-classify_env/bin/EUKulele", line 8, in <module>
    EUKulele.eukulele(string_arguments=' '.join(sys.argv[1:]))
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-classify_env/lib/python3.8/site-packages/EUKulele/EUKulele_config.py", line 32, in eukulele
    EUKulele.EUKulele_main.main(str(string_arguments))
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-classify_env/lib/python3.8/site-packages/EUKulele/EUKulele_main.py", line 242, in main
    ref_fasta, tax_tab, prot_tab = downloadDatabase(args.database.lower(),
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-classify_env/lib/python3.8/site-packages/EUKulele/download_database.py", line 97, in downloadDatabase
    rc1 = createProteinTable(create_protein_table_args)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-classify_env/lib/python3.8/site-packages/scripts/create_protein_table.py", line 84, in createProteinTable
    for record in SeqIO.parse(pepfile, "fasta"):
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-classify_env/lib/python3.8/site-packages/Bio/SeqIO/__init__.py", line 607, in parse
    return iterator_generator(handle)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-classify_env/lib/python3.8/site-packages/Bio/SeqIO/FastaIO.py", line 183, in __init__
    super().__init__(source, mode="t", fmt="Fasta")
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-classify_env/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 47, in __init__
    self.stream = open(source, "r" + mode)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/scratch/core/jespinoz/db/eukulele/marmmetsp/reference.pep.fa'

b/c of this it never creates the final files.

akrinos commented 2 years ago

Hi Josh @jolespin ,

Thanks so much for using EUKulele!

As far as the reason that it's looking for marmmetsp, that's because it thinks that one of your database files is missing, so it falls back to the use of the --database flag (which defaults tomarmmetsp), instead of eukprot. Why that's happening is what is harder to figure out, since you have the required files in the database folder. Also, the name of the database is forced into lowercase, yes (for download purposes), but if you provide a reference directory (as you have), then it shouldn't be.

(This is probably a bug that I should think about how to get around, because if you're trying to use a reference directory and don't specify the --database flag, the software shouldn't just go ahead with the default database).

What's most confusing, though, is that in your output you get the message:

Found database folder for /usr/local/scratch/CORE/jespinoz/db/eukulele/eukprot/ in current directory; will not re-download.

Which is the indication that the downloadDatabase command should never have run. Was this other error message you got in one of the log folders?

Thanks again!

jolespin commented 2 years ago

What's the difference between MARMMETSP and MMETSP?

akrinos commented 2 years ago

MarMMETSP is the new default and contains all MMETSP references plus the MarRef database

jolespin commented 2 years ago

Oh nice that makes sense. Are there both microbes and eukaryotes in MarRef?

akrinos commented 2 years ago

The MarRef database is mainly (if not all) prokaryotic - the idea behind including the two was that in many of our tests, we had bacterial contamination, but in some cases MMETSP contamination or a poor quality match would lead to the contaminated sequence being called a eukaryote https://mmp2.sfb.uit.no/marref/