eggnogdb / eggnog-mapper

Fast genome-wide functional annotation through orthology assignment
http://eggnog-mapper.embl.de
GNU Affero General Public License v3.0
561 stars 105 forks source link

extract best OG (v2.1.12) #483

Open algrgr opened 12 months ago

algrgr commented 12 months ago

Hello all,

I can't seem to get an easy solution to extract the best OG from the output table. There are two columns in the output table: "eggNOG_OGs" and "max_annot_lvl"

For example: max_annot_lvl is 4751|Fungi eggNOG_OGs is COG0474@1|root,KOG0204@2759|Eukaryota,38BS0@33154|Opisthokonta,3NX5D@4751|Fungi,3QJD1@4890|Ascomycota,20C4X@147545|Eurotiomycetes,3S5AJ@5042|Eurotiales

Given this info, what is the easy way to extract "3NX5D"? So, the script needs to check the match from one column (max_annot_lvl) in another (eggNOG_OGs) and extract partial string between , and @ (3NX5D).

Any ideas would be appreciated.

Btw, it would be nice to add "bestOG" field in the output, like it was in previous versions..

cheers, alex

Cantalapiedra commented 11 months ago

Hi @algrgr ,

For instance, you may split the eggNOG_OGs field by ",". Then split by "@". Put in a dictionary (if using Python) as key the right half (4751|Fungi) and as value the left half (3NX5D). Search the "max_annot_lvl" value in the dictionary.

I hope this is of help.

Best, Carlos

algrgr commented 11 months ago

Hello @Cantalapiedra ,

Thanks for reply! Since I don't have good skills for such complicated parsing, m'colleague came up with R script that does this extraction (frankly, I'd prefer one-liner awksolution, but ok...) Still, perhaps you could output it in separate field in the future versions of eggNOG? This will save some headache for not-so-skillful people like me : ) And thanks much for the software, btw!

cheers, alex

Cantalapiedra commented 11 months ago

It should be easy to do with awk with 2 splits and one for loop. You may do it to practice ;) or do something similar to:

cat TEST.emapper.annotations | grep -v "^#" | 
awk -F $'\t' '{split($5, v, ","); for (a in v) {split(v[a], w, "@"); if (w[2]==$6) print w[1]}}'

To make it easier to parse we may need to provide an additional file, since we are trying to avoid changing the output format as much as we can in recent versions. But yes, there are different things we may add to make life easier for users and downstream analyses. Thank you for the suggestion!

algrgr commented 11 months ago

@Cantalapiedra thanks indeed for that piece of script. That is more efficient!