OnlinePhylo / sts

Sequential Tree Sampler for online phylogenetics
GNU General Public License v3.0
16 stars 3 forks source link

Evaluate semantics of state modifications #66

Open koadman opened 11 years ago

koadman commented 11 years ago

When the model state is modified by an MCMC move or SMC extension, the affected components can either be stored in a newly allocated object or remain in-place but have values change. The former approach follows "copy-on-write" semantics, the latter does not.

In copy-on-write, summaries and metadata on an object can be keyed on the object's memory address and will remain valid for the lifetime of the object. This is how Online_calculator's log likelihood cache currently works (* see below).

One major disadvantage of the copy-on-write approach becomes apparent when considering continuous rate parameters in phylogenetic models. Large parts of a phylogenetic model have conditional dependency on single continuous parameters such as mutation rate. When mutation rate changes the LL at a forest node becomes invalid. Because the LL are keyed on object memory address, preserving the functionality of this cache approach would require us to copy not just on write, but on a write to any model parameter for which a conditional dependency exists. For large trees it might become expensive to do such copying (though possibly still cheap relative to calculating log likelihoods), not to mention tedious. For example, why should one have to write and invoke code for a whole forest traversal & copy just to change one value. I will call this approach copy-on-dependency-write.

Before we start stacking in model parameters, it seems prudent to evaluate whether we want to continue with the copy-on-dependency-write approach.

There are some alternatives. If we want to allow unfettered write access to model parameters, we might actively notify objects that have registered an interest in that model parameter. This is the java-esque EventListener approach. The likelihood cache might register as a listener for nodes, which might also listen to the rate. When the rate changes the nodes would hear about it and have to propagate a message to the likelihood cache which could then delete any cached likelihood.

Another possibility is some kind of generic versioning system (e.g. a monotonically increasing integer) for objects & their dependencies. In this approach, each model parameter would contain a "revision" object which consists of an integer version number and also pointers to version objects for any model parameters on which a conditional dependency exists, along with the most recent version of those dependencies observed by the present object. Every time an object changes, the version number is incremented. Cached values such as the LL could still be keyed on object address, but the value stored would be both log likelihood and the associated object version. When deciding whether the cached likelihood is valid, the calculator can request the current version number of an object and compare it to the version of the cached likelihood. Upon a request for its object version, the model parameter will in turn request the version of any dependency variables and compare that version to what it has recorded as the most recently observed version of that dependency variable. If the dependency is newer, the version is incremented. I will call this approach lazy parameter versioning.

As compared to the EventListener approach, lazy parameter versioning has the advantage that many parts of the model can be updated without forcing a traversal of the conditional dependency graph for each and every change.

There are surely other ways we could handle this. Does anybody have other suggestions? In the absence of other opinions I would be inclined to go for lazy parameter versioning. In either the EventListener or lazy parameter versioning approaches, we could implement this as a base class for all model components that contains a versioning system and a means to register the dependencies among objects.

(*) In the current code we follow copy-on-write for branch length changes to nodes, mostly, except for Eb_bl_proposer which can use Online_calculator::invalidate() to clear a cached likelihood entry.

koadman commented 11 years ago

I should add that copy-on-write has some advantages too. First, it makes shared-memory parallelism really easy because threads won't trip over each other by making concurrent model changes. Second, it could also be implemented using a base class to maintain a conditional dependency structure on model parameters, so the code footprint might not be large. Third, copy time is likely to be small relative to peeling (but what about other models that don't involve peeling?). Computers have gotten pretty good at allocating&freeing memory chunks since I set out on my coding journey many moons ago.

cmccoy commented 11 years ago

FWIW, Bio++ has some bits to deal with the event listener approach:

http://biopp.univ-montp2.fr/Documents/ClassDocumentation/bpp-core/html/classbpp_1_1Parameter.html#abfdb45447531f2ded0047e5526762eee

It still suffers the graph traversal problem you mention...

On Fri, Dec 14, 2012 at 6:01 AM, Aaron Darling notifications@github.comwrote:

I should add that copy-on-write has some advantages too. First, it makes shared-memory parallelism really easy because threads won't trip over each other by making concurrent model changes. Second, it could also be implemented using a base class to maintain a conditional dependency structure on model parameters, so the code footprint might not be large. Third, copy time is likely to be small relative to peeling (but what about other models that don't involve peeling?). Computers have gotten pretty good at allocating&freeing memory chunks since I set out on my coding journey many moons ago.

— Reply to this email directly or view it on GitHubhttps://github.com/metatangle/sts/issues/66#issuecomment-11377085.

koadman commented 11 years ago

i didn't see that, thanks! it actually looks workable and might save us from reinventing the wheel. the graph traversal seems unavoidable to me in any of the three situations i outlined above, it's more a question of how burdensome the traversal is.

i started to sketch out what an implementation of copy-on-dependency write would look like in code. It's a funky twist on the usual C++ clone(). Here's an example I came up with for a node class:

void node::dependency_clone(std::unordered_map< Parameter*, Parameter* >& invalid_to_new_map) {
    // do a straight up copy
    newnode = new node(this);
    // if a child of this node is also invalid get the address of the new copy
    // this little incantation could be made generic or put in the base class
    if(invalid_to_new_map.count(c1) > 0){ 
      if(invalid_to_new_map[c1] == NULL) c1->dependency_clone(invalid_to_new_map);
      newnode.c1 = invalid_to_new_map[c1];
    }
    if(invalid_to_new_map.count(c2) > 0){ 
      if(invalid_to_new_map[c2] == NULL) c2->dependency_clone(invalid_to_new_map);
      newnode.c2 = invalid_to_new_map[c2];
    }
    // this could also go in the base class
    invalid_to_new_map[this] = newnode;
}

we would probably want to use shared_ptr and/or weak_ptr instead of new but I didn't work that out for this sketch... This function would be outlined in a base class of all model parameters, something like:

class Parameter
{
public:
  Parameter* dependency_clone(Parameter* p) = 0;
  void register_dependent(Parameter* p);
private:
  // a list of parameters that have conditional dependencies on this parameter in the model
  vector< std::weak_ptr<Parameter> > dependents;
};

As with clone() we would need dependency_clone() to be defined in each derived class.

EDIT: there were a few problems with the above code...

koadman commented 11 years ago

Here we go:

void Node::dependency_clone(std::unordered_map< Parameter*, Parameter* >& invalid_to_new_map) {
    // do a straight up copy, this copies the current children pointers
    newnode = new node(this);
    // copy new children only if necessary
    depcopy(invalid_to_new_map, child1, newnode.child1);
    depcopy(invalid_to_new_map, child2, newnode.child2);

    // add the new parameter to the map
    invalid_to_new_map[this] = new_p;
}

// clone a dependency tree
void Parameter::dependency_clone() {
    std::unordered_map< Parameter*, Parameter* > invalid_to_new_map;
    dependency_pre_clone(invalid_to_new_map)
    for( auto d : dependents ){
        if(!invalid_to_new_map[d]){
           d->dependency_clone(invalid_to_new_map);
        }
    }
}

// does a depth-first traversal to discover all affected model components.
void Parameter::dependency_pre_clone(std::unordered_map< Parameter*, Parameter* >& invalid_to_new_map) {
    for( auto d : dependents ){
        invalid_to_new_map[d] = NULL;
        d->dependency_pre_clone(invalid_to_new_map);
    }
}

// create a clone of a parameter if it is a dependent and only if it hasn't already been cloned
void Parameter::depcopy(std::unordered_map< Parameter*, Parameter* >& invalid_to_new_map, Parameter* current, Parameter*& new_p)
{
    if(invalid_to_new_map.count(current) > 0){ 
      if(!invalid_to_new_map[current]){
        current->dependency_clone(invalid_to_new_map);
      }
      new_p = invalid_to_new_map[c1];
    }    
}

Derived parameter classes, e.g. Node will only need to implement the dependency_clone(invalid_to_new_map) function. We might actually break this up into two parts, one which does the clone and something like "depcopy_parts" which makes the two calls to depcopy(). This would make further subclassing of parameters possible.

cmccoy commented 11 years ago

I'm not sure I totally follow - would invalid_to_new_map be from the previous value of a parameter to the new value? Which is then filled in with all of the dependent parameters? Sorry to be slow.

koadman commented 11 years ago

yes, invalid_to_new_map would hold a mapping of previous parameter pointers (invalidated by a change) to new pointers.

We could also create a hybrid of the versioning and EventListener approaches wherein each parameter has an integer version that gets bumped whenever it's notified of a change in a paramter it depends on.