[Feature Request] Must add option to provide path to ETE3 database OR provide in EukCC db

jolespin commented 2 years ago

Here's my command:

eukcc folder --threads 16 --out eukcc_output/ --links eukcc_output/links.csv --db $DB $DIR

Here's my log:

04-03-2022 00:22:59:  EukCC version 2.1.0
04-03-2022 00:22:59:  Found 4 bins
04-03-2022 00:23:47:  Searching for marker genes in base database
04-03-2022 00:23:47:  No placement marker genes found.
04-03-2022 00:23:47:  Searching for marker genes in base database
04-03-2022 00:23:49:  Found 20 marker genes, placing them in the tree using epa-ng
NCBI database not present yet (first time used?)
Downloading taxdump.tar.gz from NCBI FTP site (via HTTP)...
Done. Parsing...
Inserting taxids:       2400000 Loading node names...
2404457 names loaded.
264829 synonyms loaded.
Loading nodes...
2404457 nodes loaded.
Linking nodes...
Tree is loaded.
Updating database: /home/jespinoz/.etetoolkit/taxa.sqlite ...
 2404000 generating entries...
Uploading to /home/jespinoz/.etetoolkit/taxa.sqlite

Traceback (most recent call last):
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-binning_env2/bin/eukcc", line 8, in <module>
    sys.exit(main())
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-binning_env2/lib/python3.8/site-packages/eukcc/__main__.py", line 418, in main
    eukcc_folder(args)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-binning_env2/lib/python3.8/site-packages/eukcc/refine.py", line 65, in eukcc_folder
    refine(state)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-binning_env2/lib/python3.8/site-packages/eukcc/refine.py", line 168, in refine
    bins.append(bin(state, wd, path, protein=True))
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-binning_env2/lib/python3.8/site-packages/eukcc/bin.py", line 22, in __init__
    self.run_eukcc()
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-binning_env2/lib/python3.8/site-packages/eukcc/bin.py", line 42, in run_eukcc
    clade = E.determine_subdb()
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-binning_env2/lib/python3.8/site-packages/eukcc/eukcc.py", line 359, in determine_subdb
    lng = tax_LCA(tree, info, places)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-binning_env2/lib/python3.8/site-packages/eukcc/treehandler.py", line 138, in tax_LCA
    info = load_tax_info(taxinfo)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-binning_env2/lib/python3.8/site-packages/eukcc/base.py", line 55, in load_tax_info
    ncbi = NCBITaxa()
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-binning_env2/lib/python3.8/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 112, in __init__
    self.update_taxonomy_database(taxdump_file)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-binning_env2/lib/python3.8/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 131, in update_taxonomy_database
    update_db(self.dbfile)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-binning_env2/lib/python3.8/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 796, in update_db
    upload_data(dbfile)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/VEBA-binning_env2/lib/python3.8/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 856, in upload_data
    db.commit()
sqlite3.OperationalError: disk I/O error

This database should be in eukcc2_db_ver_1.1 and called directly from there instead of downloading it on the first run to the home directory.

My institute only allocates 5GB to our home directories so mine is pretty much full already.

jolespin commented 2 years ago

Also, one more edit I've found:

  --links LINKS         Path to a link table generated with bamlinks.py. If suuplied paired reads will be used to refine bins (Recommended)

Should be binlinks.py for the script (and the typo)

openpaul commented 2 years ago

Thank you, thats a very good comment. Our institute is also stingy with the home folder, thus I have a symlink for the ete folder. But you are right, this would make the installation easier.

I will have to look up how to provide a different folder for ete but I think I saw that option somewhere. Then it should be pretty easy to just distribute this database.

jolespin commented 2 years ago

It's really easy, what I usually do is this:

DATABASE_TAXA="/usr/local/scratch/CORE/jespinoz/db/ncbi_taxonomy/v2021.08.03/taxa.sqlite"
...
parser.add_argument("-t","--database_taxa", type=str, required=False, default=DATABASE_TAXA, help = "taxa.sqlite [Default: {}]".format(DATABASE_TAXA))

...
ncbi = NCBITaxa(dbfile=opts.database_taxa)

It will also make everything more consistent too b/c if you run it a ear from now then it might download a different taxadb.

In the same vein, it might be useful to provide a --taxa_sqlite option or something for the ncbi_update submodule:

https://github.com/Finn-Lab/EukCC/blob/aefa73e0a897e1840011ab551ac7fbfdb4b8358e/eukcc/__main__.py#L20

https://github.com/etetoolkit/ete/blob/7d868bdae2b7dfd9c348e889c4cc320ea43f8e51/ete3/ncbi_taxonomy/ncbiquery.py#L124

but make the default the taxadump.tar.gz file (the prospective one in the update eukcc database)

Looking forward to integrating this into an essential pipeline at JCVI but this part is a limiting factor.

One more question, does EukCC save the MetaEuk genes, proteins, and gff file?

EBI-Metagenomics / EukCC

[Feature Request] Must add option to provide path to ETE3 database OR provide in EukCC db #29