WrightonLabCSU / DRAM

Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
GNU General Public License v3.0
249 stars 52 forks source link

update database version of dbcan in dram #62

Closed francis29029 closed 3 years ago

francis29029 commented 3 years ago

Hello, I would like to update the dbcan database in dram. From what I understood it's the version 8: DRAM_data/dbCAN-HMMdb-V8.txt (there is also h3f, h3p h3i and h3m files)

according the dbcan page : http://bcb.unl.edu/dbCAN2/download/ the last version is the version 9 from august 2020

any suggestions how to update the dbcan version in DRAM (I am guessing it's harder than just copy paste the txt file of the version 9 in the dram installation folder)

thanks a lot

FRancis

shafferm commented 3 years ago

Thanks for catching this! I am updating the database set up code to download the new version for new set ups.

Yes if you want to upgrade in place it is a bit more work than replacing the dbCAN-HMMdb-V8.txt file but not too much.

You need to download the files dbCAN-HMMdb-V9.txt and http://bcb.unl.edu/dbCAN2/download/Databases/CAZyDB.07302020.fam-activities.txt into your folder with the rest of the DRAM databases. Then you need to run these commands in your DRAM conda environment:

hmmpress -f dbCAN-HMMdb-V9.txt
DRAM-setup.py set_database_locations --dbcan_db_loc dbCAN-HMMdb-V9.txt --dbcan_fam_activities CAZyDB.07302020.fam-activities.txt --update_description_db

This has to rebuild the full descriptions database so it will take a while but you will only need to use one processor. After this you will be up and running with dbCAN V9.

francis29029 commented 3 years ago

Ah, great to see you back ! ok for the procedure to update to V9

Nevertheless I have to admit that I did not manage to use the v8 neither using the DRAM.py annotate with the custom_db_name options :

I tried DRAM.py annotate -i 'MAG001.fa' -o test_dbcan --custom_db_name dbCAN --custom_fasta_loc dbCAN --custom_fasta_loc dbCAN-HMMdb-V8.txt

subprocess.CalledProcessError: Command '['mmseqs', 'createdb', 'dbCAN-HMMdb-V8.txt', 'test_dbcan_re_rest1/working_dir/custom_dbs/dbCAN-HMMdb-V8.txt.custom.mmsdb']' returned non-zero exit status 1.

I am not sure if the --custom_fasta_loc dbCAN-HMMdb-V8.txt is correct or should I use another file

to make it work I downloaded the fasta file from the V9 from bdcan website : CAZyDB.07312020.fa and I run DRAM.py annotate -i 'MAG001.fa' -o test_dbcan6 --custom_db_name dbCAN --custom_fasta_loc CAZyDB.07312020.fa ---> then in the annotation.tsv file I had some "dbcan columns" which I was looking for

question to you : how to run DRAM using dbcan ? is that using DRAM.py annotate using --custom_db_name and --custom_fasta_loc ?

THanks a lot

Francis

Francis

francis29029 commented 3 years ago

just to complement my previous message: by "dbcan columns" I mean such columns : dbCAN_id dbCAN_hit dbCAN_RBH dbCAN_identity dbCAN_bitScore dbCAN_eVal

another question : do you think you could add the Diamond annotation in dram (to complement dbcan)

thanks a lot

Francis

shafferm commented 3 years ago

Hi Francis,

DRAM annotates with dbCAN2 using their provided HMMs by default. The columns with RBH, identity, bitScore and eVal are only provided with the "BLAST-style" searches which are done using mmseqs2 (this will be pretty identical results to DIAMOND). So as long as dbCAN2 is set up in DRAM (which you can check by seeing if the file path to the dbCAN2 HMMs is there when you run DRAM-setup.py print_config) you should see a column labelled cazy_hits which has a list of all the dbCAN2 HMMs which had significant hits (according to the dbCAN2 suggestions for thresholds). If you want more detail on the dbCAN2 searches you would need to run your annotation using the --keep_tmp_dir flag. Then the full detail for all hits to dbCAN2 would be stored for each input fasta and you could dig deeper into those.

The custom database search in DRAM only works for BLAST-style searches so it needs to be provided a fasta file and not a HMM file like you saw. We have plans to allow HMMs as custom databases in the future. We also can't pull CAZy IDs from your dbCAN2 hits when you annotate using the dbCAN2 fasta as a custom database for distilling.

We don't plan to add DIAMOND searching to DRAM as we chose to use mmseqs2 which provides nearly identical results.

Also DRAM now supports annotating with dbCAN2 v9 in the new release v1.2.0.

Hope this helps!

Mike

francis29029 commented 3 years ago

Hello Mike, yes it definitively helps to understand the system !

Yes we do have a dbcan installed (it's written dbcan not dbcan2 but I guess it's the same right ?) dbCAN db: /nihs/Software/python/Anaconda3-2020.11-DRAM/DRAM_data_1/dbCAN-HMMdb-V8.txt dbCAN family activities: /nihs/Software/python/Anaconda3-2020.11-DRAM/DRAM_data_1/CAZyDB.07312019.fam-activities.txt

Regarding the "dbCAN2 suggestions for thresholds" do you confirm that you are using the ones suggested by dbcan2: (see http://bcb.unl.edu/dbCAN2/blast.php) E-Value < 1e-15, coverage > 0.35

Francis

shafferm commented 3 years ago

Yes and yes. In the CONFIG file it just says dbCAN and we are using those thresholds for considering a hit to dbCAN. DRAM reports all dbCAN2 HMMs which hit that threshold for each gene.

francis29029 commented 3 years ago

Thanks for the quick answer. Very re-assuring. We are currently updating to V9 using the commands you suggested:

hmmpress -f dbCAN-HMMdb-V9.txt DRAM-setup.py set_database_locations --dbcan_db_loc dbCAN-HMMdb-V9.txt --dbcan_fam_activities CAZyDB.07302020.fam-activities.txt --update_description_db

still running for 6 hours ... any idea how long it take ? does this job must be submitted to a computer with loads of memory (similar as for the initial DRAM-setup.py prepare_databases) ?

shafferm commented 3 years ago

The job doesn't need a ton of memory but it does take a long time because it is rebuilding the entire database of all annotations from all databases that DRAM uses. Hopefully it is done by now!

francis29029 commented 3 years ago

it's up and running (we finally decided to rebuild from scratch using the dram 1.2 version that you updated ! (and yes it includes now the v9 of dbcan) ! thanks a lot for that