OSS-Lab / MetQy

Repository for R package MetQy (read related publication here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6247936/)
Other
18 stars 9 forks source link

KEGG module completeness estimation is wrong #8

Open FWittmers opened 1 year ago

FWittmers commented 1 year ago

I have come across a problem in the way this package evaluates the completeness of KEGG pathway modules, given a number KOs, or something comparable. Specifically, this is referring to the function query_missingGenes_from_module.R, but might be present in some of the other KEGG-module related functions?

The issue arrises from how the functions splits modules into blocks, based on spaces:

### SEARCH BLOCKS ----
  block_defs  <- strsplit(DEFINITION[index], split = " ")[[1]]
  nBlocks     <- length(block_defs)

This problem does not arise with every module, instead, it only occurs in more complicates modules which have "nested" blocks (for lack of a better word). For example, module 2 (https://www.genome.jp/kegg-bin/show_module?M00002) leads to this issue. The function in this package comes up with 6 blocks, (nBlocks), while there is only 5 blocks, as to my understanding of KEGG module definitions. This also matches the 5 blocks assumed in KEGGmapper. It must have something to do with the above chunk ignoring the presence of "(" or ")", ie when spaces occur WITHIN a block, instead of separating a block.

This problem should lead to a systematic underestimation of the completeness of pathways, when it inflates the number of blocks in a module. Adjusting the way the blocks are split to only split actual blocks (spaces outside of any "(" or ")"; something that can be done with regex I guess) should solve this issue.

asmvernon commented 1 year ago

Thanks for flagging. Unfortunately this package is no longer being actively maintained. Let me know if you'd be interested in contributing the fix to the package though!