Suggested edits for version 2 (Adding ete3 db, updating metaeuk, custom protein files)

jolespin commented 2 years ago

I've been taking a deep dive into eukaryotic metagenomics and EukCC seems like a great tool to have in the repertoire. From the documentation, it seems that EukCC version 2 is still in development so I thought I could make a log of suggested changes or edits:

Update MetaEuk version from mandatory metaeuk=4.a0f584d to metaeuk>=4.a0f584d. The newest version 5-34c21f2 created GFF files
Add option to provide custom protein sequence files. This could be similar to CheckM's usage. The reason for this is that I plan on adding this into a pipeline where I have already modeled the genes via EukCC using a custom DB. This could also be useful if someone wanted to use MAKER or another gene modeling software.
I've already mentioned this here but adding the ETE3 NCBI taxonomy databases in the EUKCC2DB directory. For example,

(eukcc_env) -bash-4.2$ ls -lh /usr/local/scratch/CORE/jespinoz/db/eukcc/eukcc2_db_ver_1.1.2
total 9.0K
lrwxrwxrwx 1 jespinoz tigr 67 Mar  4 12:05 db_base -> /usr/local/scratch/CORE/jespinoz/db/eukcc/eukcc2_db_ver_1.1/db_base
lrwxrwxrwx 1 jespinoz tigr 68 Mar  4 12:05 db_fungi -> /usr/local/scratch/CORE/jespinoz/db/eukcc/eukcc2_db_ver_1.1/db_fungi
lrwxrwxrwx 1 jespinoz tigr 68 Mar  4 12:05 db_plant -> /usr/local/scratch/CORE/jespinoz/db/eukcc/eukcc2_db_ver_1.1/db_plant
lrwxrwxrwx 1 jespinoz tigr 71 Mar  4 12:05 db_protozoa -> /usr/local/scratch/CORE/jespinoz/db/eukcc/eukcc2_db_ver_1.1/db_protozoa
lrwxrwxrwx 1 jespinoz tigr 73 Mar  4 12:06 taxa.sqlite -> /usr/local/scratch/CORE/jespinoz/db/ncbi_taxonomy/v2021.08.03/taxa.sqlite
lrwxrwxrwx 1 jespinoz tigr 76 Mar  4 12:06 taxdump.tar.gz -> /usr/local/scratch/CORE/jespinoz/db/ncbi_taxonomy/v2021.08.03/taxdump.tar.gz

Then the following code could be edited: From base.py

def load_tax_info(path, sep=","):
    """
    Load taxonomic information from two column csv
    With first column giving the taxid and second column
    giving the name of the tree leaf.

    :param path: path to csv file
    :return: dictionary of accession: ncbi_lineage
    """
    dbfile = os.path.join(os.environ["EUKCC2_DB"], "taxa.sqlite")

    ncbi = NCBITaxa(dbfile=dbfile)
    d = {}
    with open(path) as fin:
        reader = csv.reader(fin, delimiter=sep)
        for row in reader:
            d[row[1]] = [str(x) for x in ncbi.get_lineage(row[0])]
    return d

from __main__.py:

def update():
    from ete3 import NCBITaxa

    logging.info("Going to fetch NCBI info using ete3")
    dbfile = os.path.join(os.environ["EUKCC2_DB"], "taxa.sqlite")
    ncbi = NCBITaxa(dbfile=dbfile)

    taxdump_file = os.path.join(os.environ["EUKCC2_DB"], "taxdump.tar.gz")
    ncbi.update_taxonomy_database(taxdump_file= taxdump_file)

EDIT: The commands above won't work because I falsely assumed EUKCC2_DB was an environment variable that was propagated through the different modules but that's not the case. I was able to modify the following:

# base.py 

def load_tax_info(path, sep=",", dbfile=None):
    """
    Load taxonomic information from two column csv
    With first column giving the taxid and second column
    giving the name of the tree leaf.

    :param path: path to csv file
    :return: dictionary of accession: ncbi_lineage
    """
    dbfile = os.path.join(os.environ["EUKCC2_DB"], "taxa.sqlite")

    ncbi = NCBITaxa(dbfile=dbfile)
    d = {}
    with open(path) as fin:
        reader = csv.reader(fin, delimiter=sep)
        for row in reader:
            d[row[1]] = [str(x) for x in ncbi.get_lineage(row[0])]
    return d

# eukcc.py 
    def ncbi_place(self, state):
        """
        Given taxids we can find all reference species used to construct the
        backbone tree with overlapping taxonomy.
        Requires database to be loaded.

        :param taxids: List or set of taxids
        :return: dict with placements
        """
        # parse, so we allow comma and spaces
        taxa = []
        for t in self.state["taxids"]:
            if "," in t:
                taxa.extend(t.split(","))
            else:
                taxa.append(t)
        # convert taxa in a set of strings
        taxa = set([str(t) for t in taxa])
        dbfile = os.path.join(self.state["db"], "taxa.sqlite")
        info = load_tax_info(state["dbinfo"]["files"]["taxinfo"], sep=",", dbfile=dbfile)
        # find all nodes that intersect with the taxids
        nodes = set()
        for node, lng in info.items():
            if len(taxa & set(lng)) > 0:
                nodes.add(node)
        # make sure these are also in our SCMG set
        scmgs = load_SCMGs(state["dbinfo"]["files"]["scmgs"])
        nodes = nodes & set(scmgs.keys())

        placements = [{"n": x} for x in nodes]
        logging.info("Located {} species corresponding to the provided taxids".format(len(nodes)))
        return {"placements": placements, "genomes": nodes}

but was unable to figure out how to get the db path from treehandler.py (this function: tax_LCA)

Example error:

Traceback (most recent call last):
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/eukcc_env/bin/eukcc", line 10, in <module>
    sys.exit(main())
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/eukcc_env/lib/python3.9/site-packages/eukcc/__main__.py", line 421, in main
    eukcc_folder(args)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/eukcc_env/lib/python3.9/site-packages/eukcc/refine.py", line 65, in eukcc_folder
    refine(state)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/eukcc_env/lib/python3.9/site-packages/eukcc/refine.py", line 168, in refine
    bins.append(bin(state, wd, path, protein=True))
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/eukcc_env/lib/python3.9/site-packages/eukcc/bin.py", line 22, in __init__
    self.run_eukcc()
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/eukcc_env/lib/python3.9/site-packages/eukcc/bin.py", line 42, in run_eukcc
    clade = E.determine_subdb()
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/eukcc_env/lib/python3.9/site-packages/eukcc/eukcc.py", line 359, in determine_subdb
    lng = tax_LCA(tree, info, places)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/eukcc_env/lib/python3.9/site-packages/eukcc/treehandler.py", line 138, in tax_LCA
    info = load_tax_info(taxinfo)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/eukcc_env/lib/python3.9/site-packages/eukcc/base.py", line 55, in load_tax_info
    dbfile = os.path.join(os.environ["EUKCC2_DB"], "taxa.sqlite")
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/eukcc_env/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'EUKCC2_DB'

Is there a way to access the database path throughout all the scripts?

Apologies for bombarding your GitHub today. Finally got to a point in my pipeline that I've been working on for over a year and the EukCC bit is a critical stage.

jolespin commented 2 years ago

Just following up on this. Any thoughts? If you don't have the bandwidth to implement please let me know and I'll try to find an alternative. Thanks.

jolespin commented 2 years ago

The biggest limiting factor is load_tax_info is loaded in eukcc.py and treehandler.py so it doesn't have access to the initial arguments.

Right now, load_tax_info when using NCBITaxa essentially hardcodes the $HOME directory.

Another alternative to ete3 is just using taxonkit lineage.

I've tried a few different manual edits for this but other than hardcoding my actual path, it's unusable.

openpaul commented 2 years ago

Sorry I did not get back to you, I am currently quite busy as well but should be getting to this now.

Adding the ETE database is a very good idea in my opinion and on my todo list.

I will have a look at your suggestions and might come up with a solution to this issue.

Updating metaEuk, I am not sure about as the training was done on metaeuk 4 and updating it might lead to wrong results.

Providing custom protein files, you mean instead of providing the genome? That is already supported. Just pass the option -AA.

Sorry again for the delay.

jolespin commented 2 years ago

Awesome, thanks for getting back to me. I took a couple of cracks at it (hence the fork) but I wasn't successful without hardcoding it. I hope some of the notes above help out if you decide to implement this.

Didn't see the -AA option but that seems like exactly what I was looking for with that note!

More than willing to test out some code for you if you decide to implement.

openpaul commented 2 years ago

So, I have looked into the ETE3 database inclusion. If you want to test it, download this new database http://ftp.ebi.ac.uk/pub/databases/metagenomics/eukcc/eukcc2_db_ver_1.2.tar.gz and install the version on the dev branch including the commit 36d0257

Thank you for the motivation. Let me know if this works.

Should the usage of the provided ete database be optional or mandatory? I am leaning towards mandatory as then I do not need to add more code, but am open for comments.

jolespin commented 2 years ago

If it's already in eukcc2_db_ver_1.2.tar.gz, I would use that as a default if it's easy to make the program easier to run but being able to specify optionally would make it more flexible for people to use the NCBI taxdb that their dataset is based off.

EBI-Metagenomics / EukCC

Suggested edits for version 2 (Adding ete3 db, updating metaeuk, custom protein files) #30