Use newer uniref 100 database for kegg calling

cmkobel / CompareM2

🦠📇 Microbial genomes-to-report pipeline

https://CompareM2.readthedocs.io

GNU General Public License v3.0

52 stars 3 forks source link

Use newer uniref 100 database for kegg calling #78

Closed cmkobel closed 4 months ago

cmkobel commented 8 months ago

According to the checkm2 paper, the current version may be as old as from 2018?

cmkobel commented 6 months ago

Idea:

One plan could be instead to use GO biological process https://genomespot.blogspot.com/2024/02/dont-use-kegg.html

So download uniref and mapping_selected (https://www.uniprot.org/help/downloads). Map with diamond or mmseqs2. Download GO BP and perform the the hierarchical mapping and compute enrichment.

cmkobel commented 6 months ago

Ref https://github.com/chklovski/CheckM2/issues/99

Looks like KEGG is not a viable option for the future, and it is not possible to continue reusing the checkm2 database. Will have to seriously consider implementing GO BP GSEA.

cmkobel commented 6 months ago

As I just closed #90 for merging it into here, I should mention that in any case, for licensing reasons (I think) the user (pipeline instance) must manually download the file:

Downloaded from https://www.kegg.jp/kegg-bin/download_htext?htext=ko00001.keg&format=json&filedir= Used in kegg_pathway.R.

cmkobel commented 4 months ago

I still don't have a good plan for this. Is going back to kofam_scan the best option? How does eggnog do it, and can its output for downstream hypertests?

cmkobel commented 4 months ago

Solution was to use eggnog to map to KO. Is used in rule kegg_pathway as well now.