ModelSEED / ProbModelSEED

Other
2 stars 3 forks source link

get_model (Model Table) API Method #22

Open nconrad opened 9 years ago

nconrad commented 9 years ago

From email:

Looking at the model data, it would be something like this:

{reactions: Array[1051], compounds: Array[1103], genes: Array[1956], compartments: Array[2], biomass: Array[100]}

Where each reaction object (in the array) is of the form:

{
    "name": "L-Threonine acetaldehyde-lyase",
    "id": "rxn00541_c0",
    "eq": "L-Threonine[c0] <=> Acetaldehyde[c0] + Glycine[c0]”,
    "def": <same thing as equation but with ids for the compounds>,
    "gpr": "fig|487976.3.peg.3553 and (fig|487976.3.peg.4228 or fig|487976.3.peg.2871)",
    "evidence": <some string>,
    "gapfill_dir": true,
    "gapfill_refs": [{id: "gap.3", direction: ">"}, 
                           {id: "gap.10", direction: "<"}, ... ],
    "genes": [
        "fig|487976.3.peg.3553",
        "fig|487976.3.peg.4228",
        "fig|487976.3.peg.2871"
    ]
}

Note: The format for equation and definition is debatable.

A compound has the form:

{
    "id": "cpd00113_c0”,
    "name": "Isopentenyldiphosphate",
    "formula": "C5H10O7P2",
    "charge": -2,
}

A gene has the form:

{
    "id": "fig|29459.15.peg.1915”,
    "rxns": [“rxn12008_c0”, … ,“rxn03892_c0" ]
}

A compartment has the form:

{
    "id": "c0",
    "name": "Cytosol",
    "pH": 7,
    "potential": 0
}

And, biomass:

{
    "id": "bio1",
    "cpdID": "cpd00084_c0",
    "name": "L-Cysteine",
    "coefficient": -0.0569540049395353,
    "compartment": "c0"
}
nconrad commented 9 years ago

I added GPR to the reaction object.

mmundy42 commented 9 years ago

I’m not sure how practical this is but could you use SOLR for retrieving the details on reactions, compounds, and biomass. Then all that would need to be returned is the id for compounds, the id and gene list for reactions, and the id and coefficient for biomass.

nconrad commented 9 years ago

I think the issue with that is that there is information specific to the model, such as compartments, directionality of the reactions, custom biomass(?), and, in particular, custom reactions (and their ids, names, etc)? I'm all for improving the backend by handling this data in some other manner.

For example, one option may be to only have ids, compartments(?), directionality, and whatever else is needed as part of a user's model. This would require custom things (reactions) to be pushed to solr when a custom reaction is created. Not sure if that is feasible or not, or the right solution.

@cshenry may have other plans to put models entirely in SOLR, I'm not sure.

Either option is fine with me... I just think we should try to get logic for constructing GPRs, equations, etc, out of client side code and reduce the amount of data sent to the client. Related, https://github.com/PATRIC3/Workspace/issues/31

mmundy42 commented 9 years ago

Good point on custom reactions, although there currently is no way to import a model from another source so I'm not sure how you could get a custom reaction into a model.

For reaction directionality, that is only changed by gap filling. Which brings up the bigger issue of how gap filled reactions are handled. When you retrieve a model object directly from the workspace, you are getting the base model and the gap filled reactions are in the separate gapfill object. When the model service retrieves a model, gapfill objects that are marked as integrated are merged into the model. How are you going to identify gap filled reactions? Will they be in the list on the reactions tab or only in the gapfill tab?

mmundy42 commented 9 years ago

Here are my proposed changes to the spec file for the get_model() method.

typedef structure {
    reaction_id id;
    string name;
    string definition;
    string gpr;
    list<gene_id> genes;
} model_reaction;

typedef structure {
    compound_id id;
    string name;
    string formula;
    float charge;
} model_compound;

typedef structure {
    gene_id id;
    list<reaction_id> reactions;
} model_gene;

typedef structure {
    compartment_id id;
    string name;
    float pH;
    float potential;
} model_compartment;

typedef structure {
    biomass_id id;
    list<tuple<compound_id compound, float coefficient, compartment_id compartment>> compounds;
} model_biomass;

typedef structure {
    ref ref;
    list<model_reaction> reactions;
    list<model_compound> compounds;
    list<model_gene> genes;
    list<model_compartment> compartments;
    list<model_biomass> biomasses;
} model_data;

/*
    FUNCTION: get_model
    DESCRIPTION: This function gets model data

    REQUIRED INPUTS:
    ref model - reference to model to get

*/      
typedef structure {
    ref model;
} get_model_params;
authentication required;
funcdef get_model(get_model_params input) returns (model_data output);

@nconrad, a couple of questions:

  1. In a model_reaction, is it necessary to have the definition in terms of compound IDs? I'm not seeing where that is used when displaying reactions.
  2. In a model_biomass, would it help to have the compounds in two lists, one for reactants and one for products?
mmundy42 commented 9 years ago

@nconrad Do you have any performance targets for the get_model() method? On my test system, get_model() is taking 15 seconds. Most of the time is in retrieving the model from the workspace and building the reactions array.

nconrad commented 9 years ago

15 seconds!? I think <1 sec, including transfer to the client for a 2mb model. Milliseconds for computation. Chris said he suggests not using the built in Perl object API when doing this because that was cause of slowness in the last version of get_models... I'm not sure why though.

nconrad commented 9 years ago

...oh... How many models is this?

mmundy42 commented 9 years ago

15 seconds is for 1 model and yes, it appears the Perl object API (or its implementation) is part of the problem. I'm seeing lots of requests to pull the genome and the biochemistry from the PATRICStore cache when running get_model(). The code to create the reaction equations is buried in the objects though.

nconrad commented 9 years ago

"lots of requests to pull the genome and the biochemistry" sounds familiar now. Glad you are here to help, Mike. :)

nconrad commented 9 years ago

Let me send you a bunch of code via email.

cshenry commented 9 years ago

1.) Models will be put into SOLR eventually� although I think we may only do this for models built from PATRIC. -the indexing takes a long time (10-20 minutes) -retrieving data from SOLR is slow

2.) I think avoiding the object API will save time, but pulling down all the gap filling and eventually, all the edits is also going to be slow. I�m not sure <1 sec will be feasible. But I think we can do better than 15 sec.

On Jun 24, 2015, at 4:21 PM, nconrad notifications@github.com wrote:

15 seconds!? I think <1 sec, including transfer to the client for a 2mb model. Milliseconds for computation. Chris said he suggests not using the built in Perl object API when doing this because that was cause of slowness in the last version of get_models.

� Reply to this email directly or view it on GitHub.

cshenry commented 9 years ago

Yeah, Mike� I think we should avoid the API entirely when speed is needed. Manipulate the data directly.

Other things I�d like to do include: 1.) Keep the biochemistry in a cached file on ProbModelSEED, retrieved only when missing (so I can trigger an update at any time by logging into the server and deleting the file)

2.) Keep the fully loaded biochemistry in memory (KBaseFBAModeling does this)

3.) We shouldn�t need to pull the biochemistry for most actions. So I�d like to strip out most calls to biochemistry

4.) Do something similar to avoid calls for genomes, which will also be slow

5.) Caching genome objects from PATRIC SOLR in workspace will help, and we should probably do this. But I don�t want to lose the connection to PATRIC SOLR. We need to know that the genome originally came from PATRIC SOLR. I guess we put that in the genome object that gets cached.

6.) In the mid-term, I�d like to change the object API to work differently. No longer use moose. Instead of building an object, simply sort ref to hash with object data. This will be MUCH faster. Strip out the �lazy� loading logic at that point. Do not do validation on �load�. Only do validation on �serialization�.

Mike, how comfortable are you with working on 1-5? If I could get some cycles to finally do six, we would go a long way to massively improving performance.

On Jun 24, 2015, at 4:47 PM, Christopher Henry chrisshenry@gmail.com wrote:

1.) Models will be put into SOLR eventually� although I think we may only do this for models built from PATRIC. -the indexing takes a long time (10-20 minutes) -retrieving data from SOLR is slow

2.) I think avoiding the object API will save time, but pulling down all the gap filling and eventually, all the edits is also going to be slow. I�m not sure <1 sec will be feasible. But I think we can do better than 15 sec.

On Jun 24, 2015, at 4:21 PM, nconrad notifications@github.com wrote:

15 seconds!? I think <1 sec, including transfer to the client for a 2mb model. Milliseconds for computation. Chris said he suggests not using the built in Perl object API when doing this because that was cause of slowness in the last version of get_models.

� Reply to this email directly or view it on GitHub.

nconrad commented 9 years ago

@mmundy42

  1. In a model_reaction, is it necessary to have the definition in terms of compound IDs? I'm not seeing where that is used when displaying reactions.

I was thinking this would be an easy way of making the equation compounds clickable and mapping to images via IDs. By having the id representation, I could put links or such on the human-readable version. We could do this another way, such as lists, but I figured this was a friendly way of doing it and I think the ID representation would be useful to have in the API anyway... perhaps for other clients.

Note: I think in last version of get_models I used to do this, the reactant and products were not consistant and were on different sides... but I don't quite remember. It may have been a bug that was never fixed.

  1. In a model_biomass, would it help to have the compounds in two lists, one for reactants and one for products?

I'm soliciting input from users for this one. I can get back to you tomorrow.

mmundy42 commented 9 years ago

Some more details on where time is being spent getting a Model object. All timings from one call to the new get_model() method.

  1. 0.867 seconds to retrieve the model object data from the workspace service
  2. 2.803 seconds to convert the model object data from JSON and construct a Bio::KBase::ObjectAPI::KBaseFBA::FBAModel object
  3. 0.935 seconds to retrieve 1 gapfilling object from workspace (recursive option is on)
  4. 2.960 seconds to retrieve the biochemistry object data from the workspace service
  5. 1.852 seconds to convert the model object data from JSON
  6. 0.786 seconds to retrieve the model template object data from the workspace service
  7. 0.377 seconds to convert the model template object data from JSON
  8. 6.760 seconds to retrieve genome 224308.49 from SOLR

There are lots of lookups against the biochemistry object when processing the reactions from a gapfilling object to get a linked object but each lookup is fast (less than a millisecond). Also note that the workspace service and SOLR requests are going across the network from Mayo to ANL.

mmundy42 commented 9 years ago

Yeah, Mike� I think we should avoid the API entirely when speed is needed. Manipulate the data directly.

This is complicated by the way gapfilling and edits are handled. By keeping them in separate objects, they need to be "integrated" every time the model object is retrieved from the workspace. If get_model() needs to return the integrated model, I don't really want to duplicate all of the code to integrate the gapfilling and edits. And the gapfilling object doesn't have the details on the reactions so there would still need to be a lookup to get them.

1.) Keep the biochemistry in a cached file on ProbModelSEED, retrieved only when missing (so I can trigger an update at any time by logging into the server and deleting the file)

Yes, keeping a local copy of the biochemistry in the file system on the server running the service would help. Although I'm not sure how that would work with apps.

2.) Keep the fully loaded biochemistry in memory (KBaseFBAModeling does this)

Isn't the cache in PATRICStore already effectively doing this? If the pipelines are being run from apps, there is minimal benefit to loading the biochemistry in the server.

3.) We shouldn�t need to pull the biochemistry for most actions. So I�d like to strip out most calls to biochemistry

I'm not sure I understand this one. It seems like there are a fair number of references to reactions, compounds, and compartments in the biochemistry or model template objects.

5.) Caching genome objects from PATRIC SOLR in workspace will help, and we should probably do this. But I don�t want to lose the connection to PATRIC SOLR. We need to know that the genome originally came from PATRIC SOLR. I guess we put that in the genome object that gets cached.

PATRICStore already caches genome objects from SOLR. And I added saving the genome object to the user's workspace as a part of adding the prob anno support.

6.) In the mid-term, I�d like to change the object API to work differently. No longer use moose. Instead of building an object, simply sort ref to hash with object data. This will be MUCH faster.

I agree that moving away from moose would be a good idea. Although if you want better performance maybe we should consider implementing modeling in Java.

nconrad commented 9 years ago

added gapfill_refs, gapfill_dir, and evidence fields.

mmundy42 commented 9 years ago

First pass is in pull request #33. I still need to add the three fields that we decided to add on 6/29.

mmundy42 commented 9 years ago

Some questions on the new fields:

  1. It would be easiest to return the list of references to gapfill and fba objects in the ModelStats structure returned by list_models. Otherwise, a FBAModel object has a gapfillings array but that is an array of ModelGapfill objects which would require even more objects to be instantiated when transforming model objects from the workspace. See also issue #19. Any preferences?
  2. I can't remember if the "dir" in gapfill_dir was for "direction" or "directory".
  3. And I forgot to write down what exactly we wanted for the evidence field. I think it would make sense to add fields to the ModelReaction object that indicated if it was added by gapfill, if the direction was changed by gapfill, and whatever we wanted for evidence.
nconrad commented 9 years ago

@mmundy42 , So much for notes...I don't remember the reasoning for some of this as well. But, what you have said in number 3 makes sense to me. @cshenry ?

nconrad commented 9 years ago

Here's an update on the current performance, using one of the larger microbial models, "Escherichia coli str. K-12 substr. MG1655star".

Calling workspaces directly:

screen shot 2015-07-26 at 7 05 48 pm

Calling get_model:

screen shot 2015-07-26 at 7 04 55 pm
nconrad commented 9 years ago

People have also requested gene names in the gene table, if that's possible.

i.e.,

typedef structure {
    gene_id id;
    string name;
    list<reaction_id> reactions;
} model_gene;
samseaver commented 9 years ago

In the original feature sub-object, the spec has:

list<string> aliases

One may return the first alias, or the list of aliases, but there's no guarantees that any alias would effectively be the original gene name or external gene identifier.