Sharing user question about mcf calculation

OSS-Lab / MetQy

Repository for R package MetQy (read related publication here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6247936/)

Other

18 stars 9 forks source link

Sharing user question about mcf calculation #5

Closed asmvernon closed 3 years ago

asmvernon commented 3 years ago

Original message

I have a couple of questions, and I was wondering if you could provide some clues. I couldn't find detailed information about the module completeness fraction calculation. Do you have any specification available? I surfed your code but got lost. For instance, what happens when a module which is defined as (say) K0001+K0002, only one of the genes is found. mcf = 0.5 or mcf = 0? I have doubts about how to interpret the index in different scenarios. I have a case where I can find either one or both genes in all my genomes, but still I get a mcf = 0 for some of them. I was expecting 0.5 for those which only have 1 of them. How is this fraction calculated?

asmvernon commented 3 years ago

The modules are defined as blocks of logical expressions. I think of blocks as either an enzyme (usually a single KO where you can have alternative ones - usually closely related) or a molecular complex where two or more subunits are required.

The example you provided, K0001+K0002, only has one block and it indicates that both KOs K0001 AND K0002 are required for the block to be complete.

The mcf is defined by the number of blocks that are complete divided by the total number of blocks in the module definition.

mcf = n_blocks_present / n_blocks

As stated before, the example module only has one block, so n_blocks = 1 You can have three cases:

You have both genes >> n_blocks_present = 1
You have either of the genes (ie only one gene) >> n_blocks_present = 0
You have neither of the genes >> n_blocks_present = 0

In the first case, the mcf would be 1, while in the other two cases the mcf would be zero.

asmvernon commented 3 years ago

In terms of the code: query_genomes_to_modules calls misc_geneVector_module. In line 276, it starts a for loop over the modules. For each module, Line 282 determines the number of blocks.

The module evaluation is carried out by the function misc_evaluate_block in line 118 present <- eval(expr = parse(text = gsub(".-","",BLOCK)),EC_KO_data)

This line is an evaluation of the logical expression that makes up a block. The "expr" field would be "K0001+K0002" (or "K0001&K0002" to make it into a computer-readable logical expression), and "EC_KO_data" would be the list of KOs that you have.

This returns a boolean: you either have that block or you don't.

misc_geneVector_module stores the boolean value for each block in the module definition (line 284) and then in line 288, it adds the total blocks and divides by the total number of blocks, creating the fraction between 0 and 1.