Closed asmvernon closed 3 years ago
The modules are defined as blocks of logical expressions. I think of blocks as either an enzyme (usually a single KO where you can have alternative ones - usually closely related) or a molecular complex where two or more subunits are required.
The example you provided, K0001+K0002, only has one block and it indicates that both KOs K0001 AND K0002 are required for the block to be complete.
The mcf
is defined by the number of blocks that are complete divided by the total number of blocks in the module definition.
mcf = n_blocks_present / n_blocks
As stated before, the example module only has one block, so n_blocks = 1
You can have three cases:
n_blocks_present = 1
n_blocks_present = 0
n_blocks_present = 0
In the first case, the mcf would be 1, while in the other two cases the mcf would be zero.
In terms of the code: query_genomes_to_modules calls misc_geneVector_module. In line 276, it starts a for loop over the modules. For each module, Line 282 determines the number of blocks.
The module evaluation is carried out by the function misc_evaluate_block in line 118
present <- eval(expr = parse(text = gsub(".-","",BLOCK)),EC_KO_data)
This line is an evaluation of the logical expression that makes up a block.
The "expr
" field would be "K0001+K0002" (or "K0001&K0002" to make it into a computer-readable logical expression), and "EC_KO_data
" would be the list of KOs that you have.
This returns a boolean: you either have that block or you don't.
misc_geneVector_module
stores the boolean value for each block in the module definition (line 284) and then in line 288, it adds the total blocks and divides by the total number of blocks, creating the fraction between 0 and 1.
Original message
I have a couple of questions, and I was wondering if you could provide some clues. I couldn't find detailed information about the module completeness fraction calculation. Do you have any specification available? I surfed your code but got lost. For instance, what happens when a module which is defined as (say) K0001+K0002, only one of the genes is found. mcf = 0.5 or mcf = 0? I have doubts about how to interpret the index in different scenarios. I have a case where I can find either one or both genes in all my genomes, but still I get a mcf = 0 for some of them. I was expecting 0.5 for those which only have 1 of them. How is this fraction calculated?