AlexanderLabWHOI / EUKulele

Automatic eukaryotic taxonomic classification
MIT License
28 stars 7 forks source link

Can you give EUKulele a custom diamond database? (e.g. RefSeq nr) #38

Closed jolespin closed 1 year ago

jolespin commented 3 years ago

I'm getting all of my ducks in a row for when I'm able to revisit a project that has a lot of eukaryotes.

I had a few questions regarding databases:

  1. Is it possible to use a custom diamond database with EUKulele? For example, RefSeq's nr or a prebuilt PhyloDB?
  2. Does EUKulele create a database for every run? For example, if you ran 100 samples (and didn't combine the MAGs into a single directory), would it create a MMETSP/PHYLODB/eukprot database in each scratch directory specified?
  3. Is there a way to know which database version is downloaded?
  4. How do you download a database? I tried this: https://eukulele.readthedocs.io/en/latest/databaseandconfig.html?highlight=download but got the following error:
    (eukulele_env) -bash-4.2$ EUKulele setup --database eukprot
    Running EUKulele with command line arguments, as no valid configuration file was provided.
    METs or MAGs argument (-m/--mets_or_mags) is required with one of 'mets' or 'mags'.
    (eukulele_env) -bash-4.2$ EUKulele --version
    Running EUKulele with command line arguments, as no valid configuration file was provided.
    The current EUKulele version is 1.0.6
akrinos commented 3 years ago

Hi @jolespin ! Thanks so much for trying out EUKulele! I'll try to answer your questions in order:

  1. Definitely possible to use a custom database, but it might be difficult to coerce the software (a few extra steps) to use a DIAMOND database rather than the protein sequences and the JSON file and taxonomy table that we require for database setup. If the DIAMOND step has already been completed, you would just have to sort of trick EUKulele into thinking that the steps up to database DIAMOND run had finished. If you are starting from a DIAMOND file rather than wanting to start from constituent sequences, you would also still need the taxonomy table for labels and the JSON file for the link back to the IDs in the DIAMOND file to the taxonomy table. But it would be super interesting to know if that's a use case I should add!
  2. The idea with EUKulele is that the database is downloaded into the folder you execute the program from, such that the same database will be used for all runs executed from that folder, no matter where the output is redirected to. You can also explicitly specify the location of the database if it is named non-standardly (according to how the default databases are downloaded) or it is in a different location than where you ran the software from.
  3. Each EUKulele run returns a README file in the output directory that tells you the link to where the database was downloaded from and when. Right now we just have a single version of all the databases, but if multiple versions are added/new databases replace the old one, we'll have to also assign our own number and list it!
  4. For the database download, you can just specify -m mets (or mags) to get it to download - that's a bug where I still have that flag required even if you're not running samples, but it has no bearing on the actual download. If you just run your samples directly, the database will also be auto-downloaded.

Hopefully this helps to start to clarify things! Please let me know if anything is still confusing or there are other questions.

jolespin commented 3 years ago

Thanks for getting back to me so quickly!

Definitely possible to use a custom database, but it might be difficult to coerce the software (a few extra steps) to use a DIAMOND database rather than the protein sequences and the JSON file and taxonomy table that we require for database setup. If the DIAMOND step has already been completed, you would just have to sort of trick EUKulele into thinking that the steps up to database DIAMOND run had finished. If you are starting from a DIAMOND file rather than wanting to start from constituent sequences, you would also still need the taxonomy table for labels and the JSON file for the link back to the IDs in the DIAMOND file to the taxonomy table. But it would be super interesting to know if that's a use case I should add!

This would be extremely useful especially for datasets that are not marine such as human gut, oral, or random build surfaces. Luckily, my dataset I need this for is marine so I think the default databases should be fine but it would be cool if we could use a custom database or NR for this as well.

It looks like the limiting factor is the json and taxonomy files. If you ever have time and want to implement this feature, it could be useful to have a short little tutorial on the structure/assumptions of the taxonomy and json files. Maybe with a helper script to create the json from the taxonomy file?

I might be able to help with this (Python parts not Bash, I'm not very good at bash) in the near-ish future (working on a few papers right now).

The idea with EUKulele is that the database is downloaded into the folder you execute the program from, such that the same database will be used for all runs executed from that folder, no matter where the output is redirected to. You can also explicitly specify the location of the database if it is named non-standardly (according to how the default databases are downloaded) or it is in a different location than where you ran the software from.

This makes sense for single use cases but I think it can be redundant for doing this at scale for multiple users. Not to mention, ISPs have a download quota where they start to charge you if you go over (this happened to me when working remotely and downloading/uploading large fastq files to/from my work server).

IMO the ideal scenario to do this at scale and consistency would be if the database was already downloaded and compiled somewhere, then different people at the institute are using it. I usually have a central location for all my databases b/c most programs like Kraken2, GTDB-Tk, CheckM, CheckV, Virsorter, KOFAMSCAN, MetaPhlAn2, etc. just have an argument (or an environment variable) that asks where the databases are located.

Is there any way it can support this in the near-ish future? I'm at J. Craig Venter Institute trying establish some eukaryotic pipelines for a few labs and EUKulele would certainly be in the line up.

Apologies if I'm misinterpreting what you said here and it actually is a feature! If so, do you have any docs or examples on how I could use this feature?

Glad to help where I can.

akrinos commented 3 years ago

Hi @jolespin ! There is a helper script here available for generating the JSON and taxonomy files in the required format; what is sort of a non-negotiable for each database is the FASTA file containing the peptide sequences, and some kind of file which contains information on the taxonomy of each sequence. For the current databases we're using, this is usually how it's done: the proteins have some ID, and then the taxonomy has to be stored elsewhere (i.e., it isn't in the FASTA header). If this is often unavailable for the databases you mention, let me know and we can see how to parse the information from somewhere external or within the FASTA headers?

For the database download, it would be totally possible to just store the EUKulele database in a shared folder somewhere where everyone could use it. To prevent people writing into the folder, that shared folder would contain the original database entries as well as the DIAMOND file. The idea would be to use the --reference_dir flag listed here, and then EUKulele would look in that folder for a suitable reference database and DIAMOND file before downloading anything else. It might be useful to have another parameter that specifies that the directory should explicitly be locked, though, or that the program should not continue with downloading if a reference is not found in the specified location!

I am happy to implement additional features to make this easier or more straightforward! For the databases you're using for those other software, do you have a taxonomy listing and a DIAMOND file stored centrally, then? Thanks!

jolespin commented 1 year ago

Apologies, this must have slipped through the cracks. The issue I came across when trying to implement IIRC was that my taxonomy goes from class down to species (no supergroup field). It could also be useful having MMSEQS2 as an alternative to Diamond since this database could also be used with MetaEuk for gene calls.