cdanielmachado / carveme

CarveMe: genome-scale metabolic model reconstruction
Other
144 stars 49 forks source link

CarveMe doesn’t satisfy the original protein complex definitions during the building of GPRs #182

Open lazzarigioele opened 1 year ago

lazzarigioele commented 1 year ago

Using carveme v1.5.2.

My opinion reading your code is that there is no real check whether a particular protein complex, as it is defined in a particular model for a particular reaction, is really satisfied or not. This in my opinion can lead to wrong GPRs.

I report here below your code contained in carveme/reconstruction/scoring.py , converting the table 'gene_scores' in 'protein_scores', and finally converting 'protein_scores' in 'reaction_scores'. Then I try to give a practical example.

From gene to protein scores:

def merge_subunits(genes):
    genes = genes.dropna()
    if len(genes) == 0:
        return None
    else:
        protein = ' and '.join(sorted(genes))
        if len(genes) > 1:
            return '(' + protein + ')'
        else:
            return protein

def merge_subunit_scores(scores):
    return scores.fillna(0).mean()

protein_scores = gene_scores.groupby(['protein', 'reaction', 'model'], as_index=False) \
        .agg({'query_gene': merge_subunits, 'score': merge_subunit_scores})
protein_scores.rename(columns={'query_gene': 'GPR'}, inplace=True)

From protein to reaction scores:

def merge_proteins(proteins):
    proteins = set(proteins.dropna())
    if not proteins:
        return None
    gpr_str = ' or '.join(sorted(proteins))
    if len(proteins) > 1:
        return '(' + gpr_str + ')'
    else:
        return gpr_str

def merge_protein_scores(scores):
    return scores.max(skipna=True)

reaction_scores = protein_scores.groupby(['reaction'], as_index=False) \
        .agg({'GPR': merge_proteins, 'score': merge_protein_scores}).dropna()
avg_score = reaction_scores.query('score > 0')['score'].median()
reaction_scores['normalized_score'] = (reaction_scores['score'] / avg_score).apply(lambda x: round(x, 1))

Suppose now this is my 'gene_scores' table:

query_gene BiGG_gene score gene protein reaction model
NaN STM_v1_0.STM0970 NaN G_STM0970 P_STM0970+STM0973 R_PFL STM_v1_0
gene_7500 STM_v1_0.STM0843 784.0 G_STM0843 P_STM0843+STM0844 R_PFL STM_v1_0
gene_18818 STM_v1_0.STM0973 864.0 G_STM0973 P_STM0970+STM0973 R_PFL STM_v1_0
NaN STM_v1_0.STM0970 NaN G_STM0970 P_STM0970+STM3241 R_PFL STM_v1_0
gene_18818 STM_v1_0.STM3241 846.0 G_STM3241 P_STM0970+STM3241 R_PFL STM_v1_0
NaN STM_v1_0.STM4114 NaN G_STM4114 P_STM4114+STM4115 R_PFL STM_v1_0
NaN STM_v1_0.STM4115 NaN G_STM4115 P_STM4114+STM4115 R_PFL STM_v1_0
gene_13306 STM_v1_0.STM0844 156.0 G_STM0844 P_STM0843+STM0844 R_PFL STM_v1_0

Applying the above code would result in this 'protein_score' table:

protein reaction model GPR score
P_STM0843+STM0844 R_PFL STM_v1_0 (gene_13306 and gene_7500) 470.0
P_STM0970+STM0973 R_PFL STM_v1_0 gene_18818 432.0
P_STM0970+STM3241 R_PFL STM_v1_0 gene_18818 423.0
P_STM4114+STM4115 R_PFL STM_v1_0 None 0.0

And finally in this 'reaction_score' table:

reaction GPR score normalized_score
R_PFL ((gene_13306 and gene_7500) or gene_18818) 470.0 1.0

As you can read , the final GPR is ((gene_13306 and gene_7500) or gene_18818). The member “(gene_13306 and gene_7500)” is correct, as the both the components of the complex P_STM0843+STM0844 were well matched by Diamond.

Instead, the member “gene_18818” should not appear in the final GPR in my opinion, because neither the complex P_STM0970+STM0973 nor the complex P_STM0970+STM3241 were satisfied, since Diamond didn’t really catch the gene STM_v1_0.STM0970 (score = Nan).

This simplified example pretend to work with a 'gene_scores' table having just 1 model with 1 reaction, anyway I think it is enough to highlight the issue. With a real-life 'gene_scores' table, there is the chance that other models containing the same reaction can make the final GPR actually correct, balancing the errors. But it is just a chance...

Maybe it could be useful to have a dedicated option, like carve --strictgpr, if the user wants the original protein complex definitions to be strictly satisfied.