eggnogdb / eggnog-mapper

Fast genome-wide functional annotation through orthology assignment
http://eggnog-mapper.embl.de
GNU Affero General Public License v3.0
569 stars 106 forks source link

emapperdb-5.0.0: GO_freq, KEGG_freq, SMART_freq, and proteins in og table? #211

Closed selveyad closed 4 years ago

selveyad commented 4 years ago

Howdy,

I was wondering if there was a way to get the GO_freq, KEGG_freq, SMART_freq, and proteins columns within the og table that appear in the v4.5.1 database, but not in the v5.0.0 database? I am using EnTAP for transcriptome annotation, which utilizes EggNOG for GO annotations and it requires the values within these columns. I would like to utilize the most up-to-date information that EggNOG has to offer, instead of the older versions that EnTAP currently utilizes. If they were removed from v5.0.0 (probably for good reason), are there equivalents to these categories in v5.0.0? I can just modify the SQL commands to select the equivalent categories if they exist.

Many Thanks,

Alex

Cantalapiedra commented 4 years ago

Hi,

I am a bit confused. Is not EnTap using eggnog5 database? https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13106

"Following selection of the optimal target sequence, independent gene family assignment is initiated with a local EggNOG database via EggNOG‐mapper (Huerta‐Cepas et al., 2017). The current release, version 5.0, consisting of 4.4 M orthologous groups derived from 379 taxonomic levels, provides an alternative means of GO, pathway, and protein domain assignment (Huerta‐Cepas et al., 2019)."

selveyad commented 4 years ago

They make the comment that the current release is an "alternative means of GO, pathway, and protein domain assignment", but do not cite it as the version being used. They cite the 2017 paper as the version used.

The current release of EnTAP (0.9.2-beta) utilizes the v4.1 eggnog.db.gz, eggnog_proteins.dmnd.gz, and eggnog4.clustered_proteins.fa.gz. Excerpt from the EggnogDatabase.h code (located at https://github.com/harta55/EnTAP/blob/master/src/database/EggnogDatabase.h):

lines 64-69

    typedef enum {
        EGGNOG_VERSION_UNKONWN=0,
        EGGNOG_VERSION_EARLIER,
        EGGNOG_VERSION_4_5_1,
        EGGNOG_VERSION_MAX
    } EGGNOG_SQL_VERSION;

lines 96-98

private:

/* OLD Links
    const std::string FTP_EGGNOG_SQL  = "http://eggnogdb.embl.de/download/emapperdb-4.5.0/eggnog.db.gz";
    const std::string FTP_EGGNOG_SQL  = "http://eggnogdb.embl.de/download/latest/eggnog-mapper-data/eggnog.db.gz";
    const std::string FTP_EGGNOG_DMND = "http://eggnogdb.embl.de/download/latest/eggnog-mapper-data/eggnog_proteins.dmnd.gz";
    const std::string FTP_EGGNOG_FASTA= "http://eggnogdb.embl.de/download/latest/eggnog-mapper-data/eggnog4.clustered_proteins.fa.gz";
*/
    // EggNOG 4.1 Links
    const std::string FTP_EGGNOG_SQL  = "http://eggnog5.embl.de/download/eggnog_4.1/eggnog-mapper-data/eggnog.db.gz";
    const std::string FTP_EGGNOG_DMND = "http://eggnog5.embl.de/download/eggnog_4.1/eggnog-mapper-data/eggnog_proteins.dmnd.gz";
    const std::string FTP_EGGNOG_FASTA= "http://eggnog5.embl.de/download/eggnog_4.1/eggnog-mapper-data/eggnog4.clustered_proteins.fa.gz";

They integrated the emapperdb-4.5.1 structure into the code and v4.5.1 works just fine, as seen here (lines 104-145):

 // SQL Data Constants (from EggNOG SQL Database)

    /*      emapper.db-4.5.1
     * bigg
     *      name            VARCHAR(32)
     *      reaction        VARCHAR(32)
     * eggnog
     *      name            VARCHAR(32)
     *      group           TEXT
     * event
     *      i               INTEGER
     *      level           VARCHAR(16)
     *      og              VARCHAR(16)
     *      side1           TEXT
     *      side2           TEXT
     * gene_ontology
     *      name            VARCHAR(32)
     *      gos             TEXT
     * kegg
     *      name            VARCHAR(32)
     *      ko              VARCHAR(32)
     * og
     *      og              VARCHAR(16)
     *      level           VARCHAR(16)
     *      nm              INTEGER
     *      description     TEXT
     *      COG_categories  VARCHAR(8)
     *      GO_freq         TEXT
     *      KEGG_freq       TEXT
     *      SMART_freq      TEXT
     *      proteins        TEXT
     * orthologs
     *      name            VARCHAR(32)
     *      orthoindex      TEXT
     * seq
     *      name            VARCHAR(32)
     *      pname           VARCHAR(32)
     * version
     *      version         VARCHAR(16)
     *
     *
     */

The problem I am having is the emapperdb-5.0.0 has an expanded kegg table (not an issue), but a truncated og table (the issue). Now, removing the GO_freq, KEGG_freq, SMART_freq, and proteins columns from og may have solved a nagging issue or increased speed, what have you. However, EnTAP relies on the emapperdb-4.5.1 structure and WILL NOT work with 5.0.0.

Here is where the EggnogDatabase.cpp script fails (located at https://github.com/harta55/EnTAP/blob/master/src/database/EggnogDatabase.cpp):

lines 452-480

void EggnogDatabase::get_sql_data(QuerySequence::EggnogResults *eggnogResults) {
    // Lookup description, KEGG, protein domain from SQL database
    if (!eggnogResults->og_key.empty()) {
        std::vector<std::vector<std::string>>results;
        std::string sql_kegg;
        std::string sql_desc;
        std::string sql_protein;

        char *query = sqlite3_mprintf(
                "SELECT description, KEGG_freq, SMART_freq FROM og WHERE og=%Q",
                eggnogResults->og_key.c_str());
        try {
            results = _pSQLDatabase->query(query);
            sql_desc = results[0][0];
            sql_kegg = results[0][1];
            sql_protein = results[0][2];
            if (!sql_desc.empty() && sql_desc.find("[]") != 0) eggnogResults->description = sql_desc;
            if (!sql_kegg.empty() && sql_kegg.find("[]") != 0) {
                eggnogResults->kegg = format_sql_data(sql_kegg);
            }
            if (!sql_protein.empty() && sql_protein.find("{}") != 0){
                eggnogResults->protein_domains = format_sql_data(sql_protein);
            }
        } catch (std::exception &e) {
            // Do not fatal error
            FS_dprint(e.what());
        }
    }
}

When SQL tries to SELECT KEGG/SMART_freq FROM og, there is nothing to grab, thus the script fails with an Error querying database: near " ": syntax error. Although not very descriptive, AT ALL, I figured when the SQL SELECT query was initiated, it picked up description from og fine, but hung on KEGG_freq and gave the error as the space between description and KEGG_freq. Whether or not it is this exact space doesn't matter, the issue remains the same. The emapperdb-4.5.1 and 5.0.0 structures are different and one more akin to the 4.5.1 structure is needed to utilize the 5.0.0 database.

Two follow-up questions: 1) How much different are 4.5.1 and 5.0.0? 2) Are there comparable columns that can fill the void left by GO/KEGG/SMART_freq and proteins?

Your assistance is greatly appreciated!

Best Regards,

Alexander Selvey

Cantalapiedra commented 4 years ago

Hi Alexander,

if they use version 4.5.1 and will not update to 5.0.0 there is little we could do from our side. If you really want to have the annotation from eggnog5 then maybe you could annotate your sequences with both EnTap and eggnog-mapper, and merge them afterwards.

I am not that familiar with version 4.5.1, but I guess that those missing fields were a way to easily retrieve the annotation data for a given orthologous group.

Maybe you could retrieve those GOs from eggnog5 DB by first obtaining all the seed orthologs which are part of a given orthologous group ("eggnog" table), then retrieving all the GOs for those seeds ("gene_ontology" table). If you want to retrieve the GOs for a different orthology scheme (co-orthologs one2one, many2one, etc) you would need to go first through the "event" and "orthologs" tables path.

However, I am not completely sure that this was the information held in the GO_freq fields.

Best, Carlos

selveyad commented 4 years ago

Howdy,

Thanks for the workaround! I thought I might give you and example of the information found in each of the *_freq columns to see if you have any other suggestions.

GO_freq: {"Molecular Function":[["GO:0097159","organic cyclic compound binding","IEA",4,"100","100"],["GO:0000166","nucleotide binding","IEA",4,"100","100"],["GO:0036094","small molecule binding","IEA",4,"100","100"],["GO:0005488","binding","IEA",4,"100","100"],["GO:1901363","heterocyclic compound binding","IEA",4,"100","100"],["GO:1901265","nucleoside phosphate binding","IEA",4,"100","100"],["GO:0003676","nucleic acid binding","IEA",3,"75","98.4"],["GO:0003723","RNA binding","IEA",1,"25","92.5"]]}

KEGG_freq: [["Oocyte meiosis (04114)",4,"100","100"],["Natural killer cell mediated cytotoxicity (04650)",4,"100","100"],["Alzheimer's disease (05010)",4,"100","100"],["Calcium signaling pathway (04020)",4,"100","100"],["Dopaminergic synapse (04728)",4,"100","100"],["VEGF signaling pathway (04370)",4,"100","100"],["T cell receptor signaling pathway (04660)",4,"100","100"],["Wnt signaling pathway (04310)",4,"100","100"],["Apoptosis (04210)",4,"100","100"],["Amyotrophic lateral sclerosis (ALS) (05014)",4,"100","100"],["HTLV-I infection (05166)",4,"100","100"],["Glutamatergic synapse (04724)",4,"100","100"],["Axon guidance (04360)",4,"100","100"],["MAPK signaling pathway (04010)",4,"100","100"],["Tuberculosis (05152)",4,"100","100"],["Long-term potentiation (04720)",4,"100","100"],["Osteoclast differentiation (04380)",4,"100","100"],["Amphetamine addiction (05031)",4,"100","100"],["B cell receptor signaling pathway (04662)",4,"100","100"]]

SMART_freq: {"PFAM":[["Metallophos",4,"100","100"]],"SMART":[["PP2Ac",4,"100","100"]]}

proteins: 9598.ENSPTRP00000004567,9601.ENSPPYP00000002711,9606.ENSP00000378306,9593.ENSGGOP00000027171

Does your line of reasoning produce a similar outcome? I do like the idea of utilizing both EnTAP with dbv4.5.1 and eggnog-mapper with dbv5.0.0. Do you think that the results would be easily combined? I guess I could combine by transcript name and see how that fairs. Might need a little finagling.

Thanks for getting back to me!

Best Regards,

Alex

Cantalapiedra commented 4 years ago

Thank you for the examples Alex. I already saw them in the sqlite3 DB though :)

I guess it would give a very similar output, if not identical, but a) I am not sure b) there won't be descriptions. You can try, for example:

select name from eggnog where groups LIKE "%0RWQ4%"; 9601.ENSPPYP00000014337 9606.ENSP00000331211 9593.ENSGGOP00000008646 9598.ENSPTRP00000021421

select * from gene_ontology where name="9601.ENSPPYP00000014337"; ( Repeat for all the orthologs and merge results )

The list of terms you obtain should be similar (identical?) to that in the "og" table for 0RWQ4 group: {"Molecular Function":[["GO:0097159","organic cyclic compound binding","IEA",4,"100","100"],["GO:0000166","nucleotide binding","IEA",4,"100","100"],["GO:0036094","small molecule binding","IEA",4,"100","100"],["GO:0005488","binding","IEA",4,"100","100"],["GO:1901363","heterocyclic compound binding","IEA",4,"100","100"],["GO:1901265","nucleoside phosphate binding","IEA",4,"100","100"],["GO:0003676","nucleic acid binding","IEA",3,"75","98.4"],["GO:0003723","RNA binding","IEA",1,"25","92.5"]]}

About merging the EnTap and eggnog5 results, unfortunately I am not sure what does EnTap actually do (post-processing I mean) with the eggnog-mapper results. But if it is something you could mimick it should be ok I guess.

Best, Carlos

selveyad commented 4 years ago

Carlos,

You're the man! I'll give it a swing and see if I can jury-rig it to make EnTAP happy.

Thanks for the help!

Best Regards,

Alex