Open nconrad opened 9 years ago
I added GPR to the reaction object.
I’m not sure how practical this is but could you use SOLR for retrieving the details on reactions, compounds, and biomass. Then all that would need to be returned is the id for compounds, the id and gene list for reactions, and the id and coefficient for biomass.
I think the issue with that is that there is information specific to the model, such as compartments, directionality of the reactions, custom biomass(?), and, in particular, custom reactions (and their ids, names, etc)? I'm all for improving the backend by handling this data in some other manner.
For example, one option may be to only have ids, compartments(?), directionality, and whatever else is needed as part of a user's model. This would require custom things (reactions) to be pushed to solr when a custom reaction is created. Not sure if that is feasible or not, or the right solution.
@cshenry may have other plans to put models entirely in SOLR, I'm not sure.
Either option is fine with me... I just think we should try to get logic for constructing GPRs, equations, etc, out of client side code and reduce the amount of data sent to the client. Related, https://github.com/PATRIC3/Workspace/issues/31
Good point on custom reactions, although there currently is no way to import a model from another source so I'm not sure how you could get a custom reaction into a model.
For reaction directionality, that is only changed by gap filling. Which brings up the bigger issue of how gap filled reactions are handled. When you retrieve a model object directly from the workspace, you are getting the base model and the gap filled reactions are in the separate gapfill object. When the model service retrieves a model, gapfill objects that are marked as integrated are merged into the model. How are you going to identify gap filled reactions? Will they be in the list on the reactions tab or only in the gapfill tab?
Here are my proposed changes to the spec file for the get_model() method.
typedef structure {
reaction_id id;
string name;
string definition;
string gpr;
list<gene_id> genes;
} model_reaction;
typedef structure {
compound_id id;
string name;
string formula;
float charge;
} model_compound;
typedef structure {
gene_id id;
list<reaction_id> reactions;
} model_gene;
typedef structure {
compartment_id id;
string name;
float pH;
float potential;
} model_compartment;
typedef structure {
biomass_id id;
list<tuple<compound_id compound, float coefficient, compartment_id compartment>> compounds;
} model_biomass;
typedef structure {
ref ref;
list<model_reaction> reactions;
list<model_compound> compounds;
list<model_gene> genes;
list<model_compartment> compartments;
list<model_biomass> biomasses;
} model_data;
/*
FUNCTION: get_model
DESCRIPTION: This function gets model data
REQUIRED INPUTS:
ref model - reference to model to get
*/
typedef structure {
ref model;
} get_model_params;
authentication required;
funcdef get_model(get_model_params input) returns (model_data output);
@nconrad, a couple of questions:
@nconrad Do you have any performance targets for the get_model() method? On my test system, get_model() is taking 15 seconds. Most of the time is in retrieving the model from the workspace and building the reactions array.
15 seconds!? I think <1 sec, including transfer to the client for a 2mb model. Milliseconds for computation. Chris said he suggests not using the built in Perl object API when doing this because that was cause of slowness in the last version of get_models... I'm not sure why though.
...oh... How many models is this?
15 seconds is for 1 model and yes, it appears the Perl object API (or its implementation) is part of the problem. I'm seeing lots of requests to pull the genome and the biochemistry from the PATRICStore cache when running get_model(). The code to create the reaction equations is buried in the objects though.
"lots of requests to pull the genome and the biochemistry" sounds familiar now. Glad you are here to help, Mike. :)
Let me send you a bunch of code via email.
1.) Models will be put into SOLR eventually� although I think we may only do this for models built from PATRIC. -the indexing takes a long time (10-20 minutes) -retrieving data from SOLR is slow
2.) I think avoiding the object API will save time, but pulling down all the gap filling and eventually, all the edits is also going to be slow. I�m not sure <1 sec will be feasible. But I think we can do better than 15 sec.
On Jun 24, 2015, at 4:21 PM, nconrad notifications@github.com wrote:
15 seconds!? I think <1 sec, including transfer to the client for a 2mb model. Milliseconds for computation. Chris said he suggests not using the built in Perl object API when doing this because that was cause of slowness in the last version of get_models.
� Reply to this email directly or view it on GitHub.
Yeah, Mike� I think we should avoid the API entirely when speed is needed. Manipulate the data directly.
Other things I�d like to do include: 1.) Keep the biochemistry in a cached file on ProbModelSEED, retrieved only when missing (so I can trigger an update at any time by logging into the server and deleting the file)
2.) Keep the fully loaded biochemistry in memory (KBaseFBAModeling does this)
3.) We shouldn�t need to pull the biochemistry for most actions. So I�d like to strip out most calls to biochemistry
4.) Do something similar to avoid calls for genomes, which will also be slow
5.) Caching genome objects from PATRIC SOLR in workspace will help, and we should probably do this. But I don�t want to lose the connection to PATRIC SOLR. We need to know that the genome originally came from PATRIC SOLR. I guess we put that in the genome object that gets cached.
6.) In the mid-term, I�d like to change the object API to work differently. No longer use moose. Instead of building an object, simply sort ref to hash with object data. This will be MUCH faster. Strip out the �lazy� loading logic at that point. Do not do validation on �load�. Only do validation on �serialization�.
Mike, how comfortable are you with working on 1-5? If I could get some cycles to finally do six, we would go a long way to massively improving performance.
On Jun 24, 2015, at 4:47 PM, Christopher Henry chrisshenry@gmail.com wrote:
1.) Models will be put into SOLR eventually� although I think we may only do this for models built from PATRIC. -the indexing takes a long time (10-20 minutes) -retrieving data from SOLR is slow
2.) I think avoiding the object API will save time, but pulling down all the gap filling and eventually, all the edits is also going to be slow. I�m not sure <1 sec will be feasible. But I think we can do better than 15 sec.
On Jun 24, 2015, at 4:21 PM, nconrad notifications@github.com wrote:
15 seconds!? I think <1 sec, including transfer to the client for a 2mb model. Milliseconds for computation. Chris said he suggests not using the built in Perl object API when doing this because that was cause of slowness in the last version of get_models.
� Reply to this email directly or view it on GitHub.
@mmundy42
- In a model_reaction, is it necessary to have the definition in terms of compound IDs? I'm not seeing where that is used when displaying reactions.
I was thinking this would be an easy way of making the equation compounds clickable and mapping to images via IDs. By having the id representation, I could put links or such on the human-readable version. We could do this another way, such as lists, but I figured this was a friendly way of doing it and I think the ID representation would be useful to have in the API anyway... perhaps for other clients.
Note: I think in last version of get_models I used to do this, the reactant and products were not consistant and were on different sides... but I don't quite remember. It may have been a bug that was never fixed.
- In a model_biomass, would it help to have the compounds in two lists, one for reactants and one for products?
I'm soliciting input from users for this one. I can get back to you tomorrow.
Some more details on where time is being spent getting a Model object. All timings from one call to the new get_model() method.
There are lots of lookups against the biochemistry object when processing the reactions from a gapfilling object to get a linked object but each lookup is fast (less than a millisecond). Also note that the workspace service and SOLR requests are going across the network from Mayo to ANL.
Yeah, Mike� I think we should avoid the API entirely when speed is needed. Manipulate the data directly.
This is complicated by the way gapfilling and edits are handled. By keeping them in separate objects, they need to be "integrated" every time the model object is retrieved from the workspace. If get_model() needs to return the integrated model, I don't really want to duplicate all of the code to integrate the gapfilling and edits. And the gapfilling object doesn't have the details on the reactions so there would still need to be a lookup to get them.
1.) Keep the biochemistry in a cached file on ProbModelSEED, retrieved only when missing (so I can trigger an update at any time by logging into the server and deleting the file)
Yes, keeping a local copy of the biochemistry in the file system on the server running the service would help. Although I'm not sure how that would work with apps.
2.) Keep the fully loaded biochemistry in memory (KBaseFBAModeling does this)
Isn't the cache in PATRICStore already effectively doing this? If the pipelines are being run from apps, there is minimal benefit to loading the biochemistry in the server.
3.) We shouldn�t need to pull the biochemistry for most actions. So I�d like to strip out most calls to biochemistry
I'm not sure I understand this one. It seems like there are a fair number of references to reactions, compounds, and compartments in the biochemistry or model template objects.
5.) Caching genome objects from PATRIC SOLR in workspace will help, and we should probably do this. But I don�t want to lose the connection to PATRIC SOLR. We need to know that the genome originally came from PATRIC SOLR. I guess we put that in the genome object that gets cached.
PATRICStore already caches genome objects from SOLR. And I added saving the genome object to the user's workspace as a part of adding the prob anno support.
6.) In the mid-term, I�d like to change the object API to work differently. No longer use moose. Instead of building an object, simply sort ref to hash with object data. This will be MUCH faster.
I agree that moving away from moose would be a good idea. Although if you want better performance maybe we should consider implementing modeling in Java.
added gapfill_refs
, gapfill_dir
, and evidence
fields.
First pass is in pull request #33. I still need to add the three fields that we decided to add on 6/29.
Some questions on the new fields:
@mmundy42 , So much for notes...I don't remember the reasoning for some of this as well. But, what you have said in number 3 makes sense to me. @cshenry ?
Here's an update on the current performance, using one of the larger microbial models, "Escherichia coli str. K-12 substr. MG1655star".
Calling workspaces directly:
Calling get_model:
People have also requested gene names in the gene table, if that's possible.
i.e.,
typedef structure {
gene_id id;
string name;
list<reaction_id> reactions;
} model_gene;
In the original feature sub-object, the spec has:
list<string> aliases
One may return the first alias, or the list of aliases, but there's no guarantees that any alias would effectively be the original gene name or external gene identifier.
From email:
Looking at the model data, it would be something like this:
{reactions: Array[1051], compounds: Array[1103], genes: Array[1956], compartments: Array[2], biomass: Array[100]}
Where each reaction object (in the array) is of the form:
Note: The format for equation and definition is debatable.
A compound has the form:
A gene has the form:
A compartment has the form:
And, biomass: