How to handle asynchronous computation

ye-luo commented 5 years ago

My rough ideas for incorporating asynchronous computation but hide the detail at the lowest possible level. Since we have limited confidence in applying a tasking programming model to the whole code, the follow code may achieve hopefully sufficient asynchronous behaviour and performance.

When we compute the trial wavefunction, the call sequence is TrialWF->ratioGrad() { TrialWF->WFC[0]->ratioGrad(iel) //determinant { SPO->evaluate(iel); getInvRow(psi_inv); dot(spo_v, psi_inv); } TrialWF->WFC[1]->ratioGrad(iel); //Jastrow }

Instead, we separate ratioGrad into two parts. The async launching part and the wait TrialWF->ratioGrad() { TrialWF->WFC[0]->ratioGradLaunchAsync(iel) //determinant { SPO->evaluateLaunchAsync(iel); getInvRowLaunchAsync(psi_inv); } TrialWF->WFC[1]->ratioGradLaunchAsync(iel); //Jastrow /// finish launching async calls of all the WFCs TrialWF->WFC[0]->ratioGrad(iel) //determinant { SPO->evaluate(iel); // wait completion inside getInvRow(psi_inv); // wait completion inside dot(spo_v, psiM[iel]); } TrialWF->WFC[1]->ratioGrad(iel); //Jastrow }

This is similar to what we have in the CUDA code but I'm expanding it to allows working through levels if necessary. CUDA or OpenMP offload can be hidden beneath. In the case of CUDA, delayed update engine and SPO can use different streams to maximize asynchronous concurrent execution. The QMCPACK CUDA code relies on a single stream to enforce synchronization. SPO can also be OpenMP offload and the asynchronous control is self contained. If necessary, the TrialWF->ratioGrad can also split into ratioGradLaunchAsync and ratioGrad which can be called by the driver.

Any piece not needing async remains unchanged.

Pros: we explicitly control dependency. Cons: we explicitly control wait instead of the runtime.

PDoakORNL commented 5 years ago

The QMCPACK CUDA code is dead. My CUDA code does not depend on a single stream for synchronization. It does depend on being able to construct transfers (most important) and evaluations (not very important) from more than one trialWF worth of data.

Synchronization pattern is going to depend on architecture and whether you have bothered to have a device specialization for a particular operation and which other operations have been specialized. SPOs, Determinants, and Jastrows should not have to know the implementations of each other nor should the logic for synchronizing between them be spread through the object hierarchy.

Looking to the future I prefer this sort of thing. Reading the input(s) should produce a parameters object for each QMC run. This should include the wfc to spo mapping and the wfc's and spo's to computation device mapping. The requirements for the wfc determinant and input buffer memory as well as the SPO storage would also need to be in this structure.

//in the driver anything with the semantics of std::async could be used.
driver<Async>::do_ratioGrad
{
//crowd is in scope

det = crowd_wf[0].wfc[i];
spo = crowd_wfc_spo_map[det.id];
//crowd_calc_location(spo) returns a device tag for that spo
std::future<int> fut_spo_eval = likeSTDAsync(crowd_calc_location(spo), launch_type, multi_func<SPOType,DEVICE>.evaluate(spo, positions, iels, crowd_els, ions, crowd_v, crowd_g, crowd_h);
fut_spo_eval.get();
std::future<std::vector<ValueType>> fut_ratio_grad = likeSTDAsync(crowd_calc_location(dets), launch_type, multi_func<DetType,DEVICE>.ratioGrad(dets, crowd_v, crowd_g, iat);
}

driver<Sync>::do_ratioGrad
{
//crowd is in scope
multi_func<SPOType, DEVICE>.evaluate(spo, positions, iels, crowd_els, ions, crowd_v, crowd_g, crowd_h);
multi_func<DetType, DEVICE>.ratioGrad(crowd_v, crowd_g, iat);
}

template<class SPOTType, Device DT = CPU>
class multiFunc {
//default implementation
evaluate(SPOType spo,auto positions, auto iels, auto crowd_els, auto ions, auto& crowd_v, auto& crowd_g, auto& crowd_h) {
//or parallel block construct of your choice
for(i = 0; i < crowd_v.size(); ++i)
  spo.evaluate(positions[i],crowd_v[i],crowd_g[i], crowd_h[i];
}
};

PDoakORNL commented 5 years ago

you actually don't need the driver specialization if you make likeSTDAsync default to a blocking synchronous evaluation.

lshulen commented 5 years ago

If we pursue either of these, we need to be very careful of the trade-offs associated with moving to this model of programming. New programmers coming to the code are likely to have to learn to reason about these constructs for the first time. Also, we will have to be incredibly sure that our unit tests / integration tests are robust to catching the sort of race conditions that may not occur for every order of the evaluations.

Are we totally convinced that the speed-up gained from this programming model is worth the other costs?

prckent commented 5 years ago

@lshulen Good point. Totally agree. MiniQMC may be a good place to play, but for QMCPACK step 1 should involve the absolute minimum complexity and therefore minimum asynchronicity, possibly none. Only when that is working and we see a clear and significant benefit to a more complex and capable implementation should we move forward.

QMCPACK / miniqmc

How to handle asynchronous computation #245