bug: full ecHumanGEM has multiple usage reactions for the same proteins

wshao1 commented 1 year ago

Description of the bug:

I have an ecHumanGEM created using the full_ecModel protocol, which I want to reduce to a context-specific GEM using getSubsetEcModel. However, this returns an error due to the ecHumanGEM having multiple protein usage reactions with the same ID (i.e. 'usage_prot_Q0WX57'). This results in the following traceback:

Unable to perform assignment because the left and right sides have a different number of elements.

Error in getIndexes (line 79)
            indexes(i)=index;

Error in removeReactions (line 37)
    indexesToDelete=getIndexes(model,rxnsToRemove,'rxns');

Error in removeGenes (line 84)
            reducedModel = removeReactions(reducedModel,rxnsToRemove,removeUnusedMets,true);

Error in getSubsetEcModel (line 26)
smallEcModel = removeGenes(bigEcModel,genesToRemove,true,true,false);

Reproducing these results:

% STEP 30: Contextualize ecModels
% To exemplify the construction of a context-specific ecModel, a
% conventional GEM of HT-29 cell line is loaded.
HCT116 = load('models/hct116_ftINIT_model_40.mat')

% Make a context-specific ecModel based on the generic Human-GEM.
ecModel = loadEcModel('ecHumanGEM.yml');
ecHCT116 = getSubsetEcModel(ecModel,HCT116.hct116_model);
saveEcModel(ecModel,'ecHCT116.yml');

System information

Operating system: Windows 11
MATLAB version: 2021a
GECKO version: 3.1.1
RAVEN version: 2.8.4
Solver: Gurobi 10.0

I hereby confirm that:

[X] My GECKO installation met all requirements.
[X] This bug occurs in the main branch of the repository.
[X] A similar issue does not already exist.

edkerk commented 1 year ago

The problem is that ecHumanGEM should not have multiple protein usage reactions for the same protein (e.g. usage_prot_Q0WX57). This should normally have been prevented by #349, but ecHumanGEM uses the uniprotConversion.tsv that might still result in duplicate proteins.

edkerk commented 1 year ago

@wshao1 How do you generate ecHumanGEM? With the following code a draft model is generated, but the duplicate usage protein reaction cannot be found:

adapterLocation = fullfile(findGECKOroot,'tutorials','light_ecModel','HumanGEMAdapter.m');
adapter = ModelAdapterManager.setDefault(adapterLocation);
model = loadConventionalGEM();
[ecModel, noUniprot] = makeEcModel(model,false);
find(strcmp(ecModel.rxns,'usage_prot_Q0WX57'))

ans =

  0×1 empty double column vector

wshao1 commented 1 year ago

The only differences between my adapter and the one provided in the light_ecModel tutorial:

I chose to use Human-GEM v1.16 instead of v1.15 as the conventional GEM.
I did not run the following code to first simplify Human-GEM (lines 7-17 in HumanGEMAdapter.m):

% The model distributed with the light_ecModel tutorial is Human-GEM
% is version 1.15.0, available from
% https://github.com/SysBioChalmers/Human-GEM/releases/tag/v1.3.0
% In addition, the following lines were run to reduce its size
% before storing it in this GECKO tutorial:
% ihuman = simplifyModel(ihuman,false,false,true,true);
% [ihuman.grRules,skipped] = simplifyGrRules(ihuman.grRules,true);
% ihuman = deleteUnusedGenes(ihuman);
% ihuman = rmfield(ihuman,{'unconstrained','rxnReferences','rxnFrom','metFrom','rxnConfidenceScores'});
% ihuman.name = ihuman.id;
% writeYAMLmodel(ihuman,'human-GEM.yml')

It is unclear to me why this simplification is required, and I could not find the simplifyGrRules function in the RAVEN toolbox repository to figure out what was actually happening, so I skipped this step. I just tried running it right now, and it seems like the function cannot be found: Unrecognized function or variable 'simplifyGrRules'.

edkerk commented 1 year ago

I can replicate this with Human-GEM v1.16. The problem is with reaction MAR09490 whose gene association contains multiple isoenzymes that are all annotated to the same UniProt ID.

As a work-around, the duplicated usage_prot_XXXXXX reactions can safely be deleted. The following code keeps only one instance of the duplicated reactions:

% Make a list of unique reactions
[~,uniqueRxns] = unique(ecModel.rxns);
duplRxns = ~ismember(1:numel(ecModel.rxns), uniqueRxns);

% Remove from all reaction-related model fields
ecModel.rxns(duplRxns)          = [];
ecModel.rxnNames(duplRxns)      = [];
ecModel.S(:,duplRxns)           = [];
ecModel.lb(duplRxns)            = [];
ecModel.ub(duplRxns)            = [];
ecModel.rev(duplRxns)           = [];
ecModel.c(duplRxns)             = [];
ecModel.grRules(duplRxns)       = [];
ecModel.rxnGeneMat(duplRxns,:)  = [];
ecModel.rxnFrom(duplRxns)       = [];

The above code should work, but I haven't tested the resulting model with getSubsetEcModel as how you use it.

In the meanwhile, we'll work on changes to makeEcModel to prevent this from happening. It is not super trivial how to fix this, as it also means that there are duplicate entries in ecModel.ec.enzymes (while their corresponding ecModel.ec.genes and ecModel.ec.rxnEnzMat entries are unique). It is therefore probably best to allow for duplicate enzymes (but not genes) in ecModel.ec, but this means it needs to be checked that this does not cause any issues when e.g. running selectKcatValue or constrainEnzConcs.

Regarding simplifyGrRules, this function is from Human-GEM and it's useful to ensure that the grRules are defined in an expanded format so makeEcModel can correctly split reactions by isoenzymes (done by RAVEN's expandModel). RAVEN's standardizeGrRules is also able to do this. We might want to reconsider how makeEcModel handles this as well. However, this is unrelated to the problem encountered in this Issue.

wshao1 commented 1 year ago

Thanks @edkerk, removing the duplicated reactions allows getSubsetEcModel to run.

The resulting model shows zero growth rate however.

Growth rate in HCT116-GEM: 86.6224865446 /hour.
Growth rate in ecHuman-GEM: 0.0871287888 /hour.
Growth rate in ecHCT116-GEM: 0.0000000000 /hour.

I have also tried extracting an ecHT29 using the HT29-GEM from the light_ecModel tutorial, but that also yields zero growth rate.

Growth rate in HT29-GEM: 149.9672040992 /hour.
Growth rate in ecHuman-GEM: 0.0871287888 /hour.
Growth rate in ecHT29-GEM: 0.0000000000 /hour.

Integrating proteomics data and then flexibilizing enzyme concentrations results in the following error, suggesting getSubsetEcModel may have returned a non-functional model.

No (more) limiting enzymes have been found. Attempting to increase protein pool exchange...
Protein pool exchange was also not limiting. Inability to reach growth rate is not related to
enzyme constraints. Maximum growth rate is 0.

I will investigate this further.

edkerk commented 10 months ago

The context-specific ecModel failing to simulate is likely the result of the large ecModel and the context-specific non-ec GEM that are used as input by getSubsetEcModel are not derived from the same starting GEM. The latter is essential, as only then can a reliable subset of reactions be selected.

This is now clarified in the function documentation, and getSubsetEcModel tries to prevent such cases by at least checking that all the reactions from the context-specific non-ec GEM can be found in the large ecModel. This is not fool-proof, as other curations could have been done while keeping the same reaction identifiers, but a full check on whether both models have derived from the same starting GEM is outside of the scope.

SysBioChalmers / GECKO