WrightonLabCSU / DRAM

Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
GNU General Public License v3.0
239 stars 50 forks source link

Dram1.4rc #207

Closed rmFlynn closed 1 year ago

rmFlynn commented 1 year ago

Here is the release note for this release candidate. It is still a candidate and will not automatically update.

This is the first release candidate of DRAM1.4.0. The 1.4.0 release has significant changes that could impact your research. Please review these changes and help us validate this release!

Install / upgrade:

In a few weeks DRAM will be upgraded in Bioconda and then can be upgraded like any Conda package. You will still be able to install DRAM1.3.5 with the traditional Conda method outlined in the README, but for early adoption you will need to use the method of install below. This method is also added in the README under Install Release Candidate.

To install a potentially unstable release candidate of DRAM, use the set of commands below that are suitable to your situation. Note the comments within the code sections and there is a context in which commands must be used.

If you already have a DRAM environment and want to upgrade:

# Activate your old DRAM environment first!
# Save your old config
DRAM-setup.py export_config > my_old_config.txt
# If you want to install in a new environment follow the instructions below and import your config with the last command in this block
# Clone the git repository
git clone https://github.com/WrightonLabCSU/DRAM.git
# you may need to install pip
conda install pip3
# Make sure the pip path is in your conda environment path
which pip3
# install DRAM
pip install ./DRAM
# import your old databases
DRAM-setup.py import_config --config_loc  my_old_config.txt

To install the DRAM release candidate in a new Conda environment;

git clone https://github.com/WrightonLabCSU/DRAM.git
cd DRAM
# Install dependencies, this will also install a stable version of DRAM that will then be replaced.
conda env create --name my_dram_env -f environment.yaml
conda activate my_dram_env
# Install pip
conda install pip3
pip3 install ./

Change log:

  1. Dram distill now includes a new metabolism for methylation. Although planned for DRAM2 you can already include this tool in annotation and distillation provided you follow the instructions below.

    In order to distill with methyl, you need only download the new FASTA file and point to it with the dram custom database options that were introduced in DRAM1.3. Note that in order to distill correctly, you will need to use the correct name ‘methyl’ and must use DRAM 1.4.

    To Annotate with methyl, do something like:

    wget https://raw.githubusercontent.com/shafferm/DRAM/master/data/methylotrophy/methylotrophy.faa
    DRAM.py annotate -i '/some/path/*.fasta' -o dram_output --threads 30 --custom_db_name methyl --custom_fasta_loc methylotrophy.faa

    To Distill with methyl:

    wget https://raw.githubusercontent.com/shafferm/DRAM/master/data/methylotrophy/methylotrophy_distillate.tsv
    DRAM.py distill -i dram_output/annotations.tsv -o dram_output/distillate --custom_distillate methylotrophy_distillate.tsv

    Learn more about custom databases, in the Wiki.

  2. Glycoside hydrolase subfamily calls, subfamily calls are now being incorporated into annotations with changes in databases and code; this impacts what gets pulled into the distillate and product because these are looking for family level (e.g. AA1) not subfamily level (e.g. AA1_1, AA2_2).

    In response, DRAM is changing the output of the dbCAN database in DRAM1.4. Raw- cazyme subfamilies will be output into the cazy_id column, and the corresponding description for the cazyme family will be put into the cazy_hit column.

    The Distillation in DRAM1.4 will count cazymes marked at subfamily level on the family level; this means for cazyme family AA1 there will be 4 entries in the distillate AA1, AA1_1, AA1_2, and AA1_3 and the sum of these four will be the total number of AA1 cazymes. In DRAM1.3 and previous, the distillate for this example AA1 with no underscore would include cazymes that can be assigned to family AA1, but do not have a subfamily designation.

    The DRAM Product will also count cazymes at the family level. For the AA1 example, AA1_1, AA1_2, and AA1_3 will be counted as AA1 for the current rules in assigning cazymes to compounds.

  3. More changes are also being made that will affect CAZY IDs in DRAM1.4. The cutoff e-value is being changed to 1e-18 to conform to best practices for the database.

    DRAM1.4 also introduced a new column for best hit per gene from dbCAN database named cazy_best_hit. This column will be the match to the gene that has the highest coverage and lowest full-sequence e-value as calculated by mmseqs, with priority on e-value. Cazy_best_hit will be the only column considered downstream in the distillate and product. DRAM1.3 pulls and counts all dbCAN hits above e-value 1e-15, rather than profiling best hits.

    New column corresponding to EC number information from subfamilies, named cazy_subfamily_ec has been added in DRAM1.4. These EC numbers will also be used as part of the distillate along with those from kegg, as part of pathways and other tools. For now, incomplete EC numbers will be included, but not considered for the distillate. The subfamilies will be excluded from the product in order to facilitate its goals of being a larger overview.

  4. Logging is now fully implemented in DRAM1.4. Log files will be created for almost all DRAM functions. The log file for annotations will appear in the annotations' folder by default, and the log file for the dram distillation will by default be in the distillation folder. You can also use the --log_file_path argument to set the log path. A log file for database processing is set by the config file, and by default it will be in the databases' directory. All content that DRAM prints to the command line will appear in the log file .

    1. The dram config now stores when databases were downloaded, citation information and version information when applicable. This information is printed to the log at the beginning of each run. The old format can still be imported if you want to keep your DRAM1.3 databases.
  5. Significant Bug fixes are also included in this release.

    • When the input fastas contain duplicates in their header names, the dram annotate step should fail with an error immediately, not at the end of the annotation process, this will save some people a lot of time. It may be that this is only a problem for annotating genomes, in any case it must be in place across workflows.
    • Some users have firewalls on their HPC environments that prevent the download via ftp in some cases converting to http can solve download problems. In DRAM1.4 if ftp links fail, a back-up http link will be attempted before an error is thrown. See issue #206.
    • DRAM1.4 will ensure that if no databases are downloaded, DRAM setup will still work. Previously, some databases depend on data being downloaded and can't be set up with a provided data set.
    • Reduced unnecessary warnings in various repetitive tasks in DRAM distillation by refactoring pandas code.
    • BIO-RELATED This bug change could affect biology. In the past, the counting of EC numbers was inconsistent. When counting the number of EC numbers in a row of the annotations file duplicates were not counted, however if counting the EC numbers for the full set of data the count of EC numbers included such duplicates. This is now corrected, but it could have some small unexpected downstream effects.
    • Glycoside hydrolase subfamily calls.
    • In response to issue #122 You can now pass a config file at run time or by setting the environment variable DRAM_CONFIG_LOCATION. Read more in the Wiki.