WrightonLabCSU / DRAM

Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
GNU General Public License v3.0
239 stars 50 forks source link

Error in Getting hits from dbCAN #245

Closed dgittins closed 1 year ago

dgittins commented 1 year ago

Hello

I am running DRAM.py annotate in DRAM v1.4.3 with 1 TB of memory, but I get the following error:

2022-12-20 13:44:47,875 - The log file is created at dram_annotation/annotate.log.
2022-12-20 13:44:47,876 - 1 FASTAs found
2022-12-20 13:44:47,886 - Starting the Annotation of Bins with database configuration: 

2022-12-20 13:44:47,887 - Retrieved database locations and descriptions
2022-12-20 13:44:47,887 - Annotating SRR4293331_maxbin.001
2022-12-20 13:45:18,913 - Turning genes from prodigal to mmseqs2 db
2022-12-20 13:45:21,973 - Getting hits from kofam
2022-12-20 13:59:16,733 - Getting forward best hits from peptidase
2022-12-20 13:59:23,735 - Getting reverse best hits from peptidase
2022-12-20 13:59:28,914 - Getting descriptions of hits from peptidase
2022-12-20 13:59:29,067 - Getting hits from pfam
2022-12-20 13:59:37,476 - Getting hits from dbCAN
Traceback (most recent call last):
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
    self.dialect.do_execute(
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: no such column: dbcan_description.ec

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/daniel.gittins/miniconda3/envs/dram/bin/DRAM.py", line 207, in <module>
    args.func(**args_dict)
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/mag_annotator/annotate_bins.py", line 1513, in annotate_bins
    all_annotations = annotate_fastas(
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/mag_annotator/annotate_bins.py", line 1394, in annotate_fastas
    annotate_fasta(
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/mag_annotator/annotate_bins.py", line 1239, in annotate_fasta
    annotations = annotate_orfs(
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/mag_annotator/annotate_bins.py", line 1107, in annotate_orfs
    run_hmmscan(
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/mag_annotator/utils.py", line 210, in run_hmmscan
    return formater(hits)
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/mag_annotator/annotate_bins.py", line 320, in dbcan_hmmscan_formater
    hits_df[f"{db_name}_hits"] = hits_df[f"{db_name}_ids"].apply(description_pull)
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/pandas/core/series.py", line 4771, in apply
    return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/pandas/core/apply.py", line 1105, in apply
    return self.apply_standard()
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/pandas/core/apply.py", line 1156, in apply_standard
    mapped = lib.map_infer(
  File "pandas/_libs/lib.pyx", line 2918, in pandas._libs.lib.map_infer
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/mag_annotator/annotate_bins.py", line 313, in description_pull
    description_list = db_handler.get_descriptions(
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/mag_annotator/database_handler.py", line 206, in get_descriptions
    descriptions = [
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/mag_annotator/database_handler.py", line 211, in <listcomp>
    .all()
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2772, in all
    return self._iter().all()
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2915, in _iter
    result = self.session.execute(
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1714, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1705, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1572, in _execute_clauseelement
    ret = self._execute_context(
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1943, in _execute_context
    self._handle_dbapi_exception(
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2124, in _handle_dbapi_exception
    util.raise_(
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 210, in raise_
    raise exception
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
    self.dialect.do_execute(
  File "/home/daniel.gittins/miniconda3/envs/dram/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such column: dbcan_description.ec
[SQL: SELECT dbcan_description.id AS dbcan_description_id, dbcan_description.description AS dbcan_description_description, dbcan_description.ec AS dbcan_description_ec 
FROM dbcan_description 
WHERE dbcan_description.id IN (?, ?, ?)]
[parameters: ('AA1', 'AA1', 'AA1')]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

Do you know what is causing this?

Thank you

rmFlynn commented 1 year ago

Yes, it looks like dbcan is too old and so its descriptions do not contain the necessary sub-family EC numbers. If this is a new database, then we will need to look deeper. The official advice is to rebuild a database, but I will let you in on a secret, you may be able to use DRAM-setup.py prepare_databases --select_db dbcan to update just dbcan it is worth a try at least. This eventuality was not covered in the release note, and I will fix that now. Sorry for the frustration!

ganiatgithub commented 1 year ago

Hi, thanks for the secret. Could you take a look at the following:

DRAM-setup.py prepare_databases --select_db dbcan --output_dir /fs03/rp24/Database/DRAM --threads 2 --dbcan_fam_activities /fs03/rp24/Database/DRAM/CAZyDB.08062022.fam-activities.txt
2023-01-07 16:31:27,352 - Starting the process of downloading data
2023-01-07 16:31:27,353 - The kegg_loc argument was not used to specify a downloaded kegg file, and dram can not download it its self. So it is assumed that the user wants to set up DRAM without it
2023-01-07 16:31:27,353 - The gene_ko_link_loc argument was not used to specify a downloaded gene_ko_link file, and dram can not download it its self. So it is assumed that the user wants to set up DRAM without it
2023-01-07 16:31:27,354 - Database preparation started
2023-01-07 16:31:27,354 - Downloading dbcan
2023-01-07 16:31:36,143 - All raw data files were downloaded successfully
2023-01-07 16:31:36,144 - Processing dbcan
2023-01-07 16:31:38,503 - dbCAN database processed
2023-01-07 16:31:38,513 - Moved dbcan to final destination, configuration updated
2023-01-07 16:31:38,513 - Populating the description db, this may take some time
Traceback (most recent call last):
  File "/home/gnii0001/rp24/gaofeng/tools/Miniconda3/envs/DRAM/bin/DRAM-setup.py", line 184, in <module>
    args.func(**args_dict)
  File "/home/gnii0001/rp24/gaofeng/tools/Miniconda3/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_processing.py", line 578, in prepare_databases
    db_handler.populate_description_db(db_handler.config['description_db'], select_db, update_config=False)
  File "/home/gnii0001/rp24/gaofeng/tools/Miniconda3/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_handler.py", line 505, in populate_description_db
    check_db(i, k)
  File "/home/gnii0001/rp24/gaofeng/tools/Miniconda3/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_handler.py", line 465, in check_db
    db_function(), f"{db_name}_description", clear_table=True
  File "/home/gnii0001/rp24/gaofeng/tools/Miniconda3/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_handler.py", line 400, in process_dbcan_descriptions
    with open(dbcan_fam_activities) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/projects-wrighton-2/DRAM/development_flynn/public_DRAM/sep_12_22_dram1.4_rc_setup_test/testoutput/DRAM1_4_pycallgraph_3/CAZyDB.08062022.fam-activities.txt'

PS. I first tried without --dbcan_fam_activities , it returns the same error.

rmFlynn commented 1 year ago

so sorry I let my test environment get committed so some paths got into your config simply edit your config to replace them with null or post the output of DRAM-setup.py export_config here and i will fix it really fast. I already did this once but now i have fixed it so it is impossible. Sorry.

ganiatgithub commented 1 year ago

What I did was importing an existing CONFIG using DRAM-setup.py import_config --config_loc /fs03/rp24/Database/DRAM/CONFIG, and DRAM was installed through git clone, followed by pip3 install, version 1.4.4

Now I see the issue, here it is:

{
  "search_databases": {
    "kegg": "/home/projects-wrighton-2/DRAM/development_flynn/public_DRAM/sep_12_22_dram1.4_rc_setup_test/testoutput/DRAM1_4_pycallgraph_3/kegg.20221012.mmsdb",
    "kofam_hmm": "/home/projects-wrighton-2/DRAM/development_flynn/public_DRAM/sep_12_22_dram1.4_rc_setup_test/testoutput/DRAM1_4_pycallgraph_3/kofam_profiles.hmm",
    "kofam_ko_list": "/home/projects-wrighton-2/DRAM/development_flynn/public_DRAM/sep_12_22_dram1.4_rc_setup_test/testoutput/DRAM1_4_pycallgraph_3/kofam_ko_list.tsv",
    "uniref": "/home/projects-wrighton-2/DRAM/development_flynn/public_DRAM/sep_12_22_dram1.4_rc_setup_test/testoutput/DRAM1_4_pycallgraph_3/uniref90.20220928.mmsdb",
    "pfam": "/home/projects-wrighton-2/DRAM/development_flynn/public_DRAM/sep_12_22_dram1.4_rc_setup_test/testoutput/DRAM1_4_pycallgraph_3/pfam.mmspro",
    "dbcan": "/fs03/rp24/Database/DRAM/dbCAN-HMMdb-V11.txt",
    "viral": "/home/projects-wrighton-2/DRAM/development_flynn/public_DRAM/sep_12_22_dram1.4_rc_setup_test/testoutput/DRAM1_4_pycallgraph_3/refseq_viral.20220928.mmsdb",
    "peptidase": "/home/projects-wrighton-2/DRAM/development_flynn/public_DRAM/sep_12_22_dram1.4_rc_setup_test/testoutput/DRAM1_4_pycallgraph_3/peptidases.20220928.mmsdb",
    "vogdb": "/home/projects-wrighton-2/DRAM/development_flynn/public_DRAM/sep_12_22_dram1.4_rc_setup_test/testoutput/DRAM1_4_pycallgraph_3/vog_latest_hmms.txt"
  },
  "database_descriptions": {
    "pfam_hmm": "/home/projects-wrighton-2/DRAM/development_flynn/public_DRAM/sep_12_22_dram1.4_rc_setup_test/testoutput/DRAM1_4_pycallgraph_3/Pfam-A.hmm.dat.gz",
    "dbcan_fam_activities": "/home/projects-wrighton-2/DRAM/development_flynn/public_DRAM/sep_12_22_dram1.4_rc_setup_test/testoutput/DRAM1_4_pycallgraph_3/CAZyDB.08062022.fam-activities.txt",
    "dbcan_subfam_ec": "/home/projects-wrighton-2/DRAM/development_flynn/public_DRAM/sep_12_22_dram1.4_rc_setup_test/testoutput/DRAM1_4_pycallgraph_3/CAZyDB.08062022.fam.subfam.ec.txt",
    "vog_annotations": "/home/projects-wrighton-2/DRAM/development_flynn/public_DRAM/sep_12_22_dram1.4_rc_setup_test/testoutput/DRAM1_4_pycallgraph_3/vog_annotations_latest.tsv.gz"
  },
  "dram_sheets": {
    "genome_summary_form": "/home/projects-wrighton-2/DRAM/development_flynn/public_DRAM/sep_12_22_dram1.4_rc_setup_test/testoutput/DRAM1_4_pycallgraph_3/genome_summary_form.20220928.tsv",
    "module_step_form": "/home/projects-wrighton-2/DRAM/development_flynn/public_DRAM/sep_12_22_dram1.4_rc_setup_test/testoutput/DRAM1_4_pycallgraph_3/module_step_form.20220928.tsv",
    "etc_module_database": "/home/projects-wrighton-2/DRAM/development_flynn/public_DRAM/sep_12_22_dram1.4_rc_setup_test/testoutput/DRAM1_4_pycallgraph_3/etc_mdoule_database.20220928.tsv",
    "function_heatmap_form": "/home/projects-wrighton-2/DRAM/development_flynn/public_DRAM/sep_12_22_dram1.4_rc_setup_test/testoutput/DRAM1_4_pycallgraph_3/function_heatmap_form.20220928.tsv",
    "amg_database": "/home/projects-wrighton-2/DRAM/development_flynn/public_DRAM/sep_12_22_dram1.4_rc_setup_test/testoutput/DRAM1_4_pycallgraph_3/amg_database.20220928.tsv"
  },
  "dram_version": "1.4.0rc1",
  "description_db": "/fs03/rp24/Database/DRAM/description_db.sqlite",
  "setup_info": {
    "kegg": {
      "name": "KEGG db",
      "description_db_updated": "10/12/2022, 18:52:36",
      "citation": " M. Kanehisa, M. Furumichi, Y. Sato, M. Ishiguro-Watanabe, and M. Tanabe, \"Kegg: integrating viruses and cellular organisms,\" Nucleic acids research, vol. 49, no. D1, pp. D545\u2013D551, 2021."
    },
    "kofam_hmm": {
      "name": "KOfam db",
      "citation": "T. Aramaki, R. Blanc-Mathieu, H. Endo, K. Ohkubo, M. Kanehisa, S. Goto, and H. Ogata, \"Kofamkoala: Kegg ortholog assignment based on profile hmm and adaptive score threshold,\" Bioinformatics, vol. 36, no. 7, pp. 2251\u20132252, 2020.",
      "Download time": "09/28/2022, 11:00:09",
      "Origin": "Downloaded by DRAM"
    },
    "kofam_ko_list": {
      "name": "KOfam KO list",
      "citation": "T. Aramaki, R. Blanc-Mathieu, H. Endo, K. Ohkubo, M. Kanehisa, S. Goto, and H. Ogata, \"Kofamkoala: Kegg ortholog assignment based on profile hmm and adaptive score threshold,\" Bioinformatics, vol. 36, no. 7, pp. 2251\u20132252, 2020.",
      "Download time": "09/28/2022, 11:00:11",
      "Origin": "Downloaded by DRAM"
    },
    "uniref": {
      "name": "UniRef db",
      "description_db_updated": "09/29/2022, 13:14:40",
      "citation": "Y. Wang, Q. Wang, H. Huang, W. Huang, Y. Chen, P. B. McGarvey, C. H. Wu, C. N. Arighi, and U. Consortium, \"A crowdsourcing open platform for literature curation in uniprot,\" PLoS Biology, vol. 19, no. 12, p. e3001464, 2021.",
      "version": "90",
      "Download time": "09/28/2022, 11:15:01",
      "Origin": "Downloaded by DRAM"
    },
    "pfam": {
      "name": "Pfam db",
      "citation": "J. Mistry, S. Chuguransky, L. Williams, M. Qureshi, G. A. Salazar, E. L. Sonnhammer, S. C. Tosatto, L. Paladin, S. Raj, L. J. Richardson et al., \"Pfam: The protein families database in 2021,\" Nucleic acids research, vol. 49, no. D1, pp. D412\u2013D419, 2021.",
      "Download time": "09/28/2022, 11:49:29",
      "Origin": "Downloaded by DRAM",
      "description_db_updated": "09/29/2022, 13:23:47"
    },
    "pfam_hmm": {
      "name": "Pfam hmm dat",
      "description_db_updated": "Unknown, or Never",
      "citation": "J. Mistry, S. Chuguransky, L. Williams, M. Qureshi, G. A. Salazar, E. L. Sonnhammer, S. C. Tosatto, L. Paladin, S. Raj, L. J. Richardson et al., \"Pfam: The protein families database in 2021,\" Nucleic acids research, vol. 49, no. D1, pp. D412\u2013D419, 2021.",
      "Download time": "09/28/2022, 11:49:31",
      "Origin": "Downloaded by DRAM"
    },
    "dbcan": {
      "name": "dbCAN db",
      "citation": "Y. Yin, X. Mao, J. Yang, X. Chen, F. Mao, and Y. Xu, \"dbcan: a web resource for automated carbohydrate-active enzyme annotation,\" Nucleic acids research, vol. 40, no. W1, pp. W445\u2013W451, 2012.",
      "version": "11",
      "Download time": "01/07/2023, 16:31:36",
      "Origin": "Downloaded by DRAM"
    },
    "dbcan_fam_activities": {
      "name": "dbCAN family activities",
      "citation": "Y. Yin, X. Mao, J. Yang, X. Chen, F. Mao, and Y. Xu, \"dbcan: a web resource for automated carbohydrate-active enzyme annotation,\" Nucleic acids research, vol. 40, no. W1, pp. W445\u2013W451, 2012.",
      "version": "11",
      "upload_date": "08062022",
      "Download time": "09/28/2022, 11:49:33",
      "Origin": "Downloaded by DRAM"
    },
    "dbcan_subfam_ec": {
      "name": "dbCAN subfamily EC numbers",
      "citation": "Y. Yin, X. Mao, J. Yang, X. Chen, F. Mao, and Y. Xu, \"dbcan: a web resource for automated carbohydrate-active enzyme annotation,\" Nucleic acids research, vol. 40, no. W1, pp. W445\u2013W451, 2012.",
      "version": "11",
      "upload_date": "08062022",
      "Download time": "09/28/2022, 11:49:33",
      "Origin": "Downloaded by DRAM"
    },
    "vogdb": {
      "name": "VOGDB db",
      "citation": "J. Thannesberger, H.-J. Hellinger, I. Klymiuk, M.-T. Kastner, F. J. Rieder, M. Schneider, S. Fister, T. Lion, K. Kosulin, J. Laengle et al., \"Viruses comprise an extensive pool of mobile genetic elements in eukaryote cell cultures and human clinical samples,\" The FASEB Journal, vol. 31, no. 5, pp. 1987\u20132000, 2017.",
      "version": "latest",
      "Download time": "09/28/2022, 11:51:57",
      "Origin": "Downloaded by DRAM",
      "description_db_updated": "09/29/2022, 13:24:14"
    },
    "vog_annotations": {
      "name": "VOG annotations",
      "description_db_updated": "Unknown, or Never",
      "citation": "J. Thannesberger, H.-J. Hellinger, I. Klymiuk, M.-T. Kastner, F. J. Rieder, M. Schneider, S. Fister, T. Lion, K. Kosulin, J. Laengle et al., \"Viruses comprise an extensive pool of mobile genetic elements in eukaryote cell cultures and human clinical samples,\" The FASEB Journal, vol. 31, no. 5, pp. 1987\u20132000, 2017.",
      "version": "latest",
      "Download time": "09/28/2022, 11:51:58",
      "Origin": "Downloaded by DRAM"
    },
    "viral": {
      "name": "RefSeq Viral db",
      "description_db_updated": "09/29/2022, 13:16:15",
      "citation": "J. R. Brister, D. Ako-Adjei, Y. Bao, and O. Blinkova, \"Ncbi viral genomes resource,\" Nucleic acids research, vol. 43, no. D1, pp. D571\u2013D577, 2015. [3] M. Kanehisa, M. Furumichi, Y. Sato, M. Ishiguro-Watanabe, and M. Tan-abe, \"Kegg: integrating viruses and cellular organisms,\" Nucleic acids research, vol. 49, no. D1, pp. D545\u2013D551, 2021.",
      "viral_files": 2,
      "Download time": "09/28/2022, 11:52:20",
      "Origin": "Downloaded by DRAM"
    },
    "peptidase": {
      "name": "MEROPS peptidase db",
      "description_db_updated": "09/29/2022, 13:23:40",
      "citation": "N. D. Rawlings, A. J. Barrett, P. D. Thomas, X. Huang, A. Bateman, and R. D. Finn, \"The merops database of proteolytic enzymes, their substrates and inhibitors in 2017 and a comparison with peptidases in the panther database,\" Nucleic acids research, vol. 46, no. D1, pp. D624\u2013D632, 2018.",
      "Download time": "09/28/2022, 12:01:46",
      "Origin": "Downloaded by DRAM"
    },
    "genome_summary_form": {
      "name": "Genome summary form",
      "branch": "master",
      "Download time": "09/28/2022, 12:01:46",
      "Origin": "Downloaded by DRAM"
    },
    "module_step_form": {
      "name": "Module step form",
      "branch": "master",
      "Download time": "09/28/2022, 12:01:47",
      "Origin": "Downloaded by DRAM"
    },
    "function_heatmap_form": {
      "name": "Function heatmap form",
      "branch": "master",
      "Download time": "09/28/2022, 12:01:47",
      "Origin": "Downloaded by DRAM"
    },
    "amg_database": {
      "name": "AMG database",
      "branch": "master",
      "Download time": "09/28/2022, 12:01:47",
      "Origin": "Downloaded by DRAM"
    },
    "etc_module_database": {
      "name": "ETC module database",
      "branch": "master",
      "Download time": "09/28/2022, 12:01:47",
      "Origin": "Downloaded by DRAM"
    }
  },
  "log_path": null
}
rmFlynn commented 1 year ago

So looking over this, it seems that the only database that was updated is dbcan and the rest are the defaults from my test environment. So maybe the import failed, or maybe the update was after the import. in any case, I would copy your original environment find dbcan and replace it "dbcan": "/fs03/rp24/Database/DRAM/dbCAN-HMMdb-V11.txt", also download these: CAZyDB.08062022.fam-activities.txt.gz CAZyDB.08062022.fam.subfam.ec.txt.gz unzip them in the same place and replace dbcan_fam_activities and dbcan_subfam_ec with these lines:

 "dbcan_fam_activities": "/fs03/rp24/Database/DRAM/CAZyDB.08062022.fam-activities.txt.gz",
 "dbcan_subfam_ec":"/fs03/rp24/Database/DRAM/CAZyDB.08062022.fam.subfam.ec.txt"

Course that will only work if you still have your old enviroment if you lost it you will need full setup. Get the empty config wget https://raw.githubusercontent.com/shafferm/DRAM/master/mag_annotator/CONFIG, import it with DRAM-setup.py import_config --config_loc some/where/CONFIG and you will need to run setup again. And I am sorry for the complications.

ganiatgithub commented 1 year ago

Worked with v1.4.6, much appreciated!