QMCPACK / miniqmc

QMCPACK miniapp: a simplified real space QMC code for algorithm development, performance portability testing, and computer science experiments
Other
27 stars 35 forks source link

Need batched evaluation of NLPP #223

Closed ye-luo closed 5 years ago

ye-luo commented 5 years ago

Currently evaluating NLPP is implemented very naively in miniQMC. The batched evaluation of NLPP in QMCPACK is a more clean and general implentation. It needs to be moved.

PDoakORNL commented 5 years ago

Already done for CUDA certainly don't want to see that work redone as well.

ye-luo commented 5 years ago

In QMCPACK, batching is not just using particle positions and it needs to work through the call stacks. Some batched NLPP operations also need distance tables. The algorithm can be used different between LCAO and splines. Once I PR, I will show you why that it needs to be done that way.

ye-luo commented 5 years ago

I wrote down How-does-QMCPACK-evaluate-pseudopotentials. miniQMC is basically mimicking the code path without VritualParticleSet. I was thinking of porting the VritualParticleSet to miniQMC. @PDoakORNL what you did Crowd<Devices::CUDA>::calcNLPP is very similar to QMCPACK CUDA code. There are a few drawback in the CUDA code.

  1. It cannot support hybrid orbital representation which needs both quadrature point (QP) positions and their distance and displacement to ions. Using VirtualParticleSet solves this problem in a more general and consistent way. A few places also benefit from being able to implement better algorithm with batching QPs.
  2. Significant rearrange of NonLocalECPotential::evaluate and NonLocalECPComponent::evaluateOne is needed to handle batching. The QMCPACK CUDA code and CPU code has a significant divergence due to batching.
  3. It was hard coded buffer size 4800 and you have 256. Since usually we have 12 QP and chunk Nelec orbitals in blocks of 128. For example, 1024 SPO /128 = 8. 8*12 = 96. This may be large enough because of concurrent batches. But I don't have any conclusion and more study is needed. If batching 12QP and using block of orbitals is not heavy enough for GPU, we have to be do something between VritualParticleSet and what you have.
  4. To me the batching over walkers is not necessary and probably bad, because of diverges at if statement for pseudo-potential radius.

PS: I noticed that having SPO evaluation directly in the driver misguide you to separate batching SPO evaluation from wavefunction. To represent QMCPACK, I will move it into deterimant evaluation once I drop the delayed update to miniQMC.

PDoakORNL commented 5 years ago

Effiecient batching is device implementation dependent. A particular Crowd::evalNLPP(...) should based on the call using a sensible algorithm for that device. Since devices evolve quickly this code should be specialized and not try to be general to "all" devices. NonLocalECPComponent::evaluate should be the "reference" implementation for one NonLocalECPComponenet. Crowd::evalNLPP should contain the "default algorithm"

My CUDA implementation is a WIP but "good enough" anything hardcoded is for volta and would need to be varied based on device detection i.e. device dependent code. Without concurrent batches the CUDA is not efficient. You must overlap transfer and kernels or the CUDA eval is pointless. The batching is not over walkers, the batch just provides enough data that its worth doing even for small systems (if you have enough walkers per thread). Other than scope this is a good reason for the Crowd to manage the batch evals and memory resources used for it.

I don't consider an implementation where multi algorithms can be selected between to be useful. Propose a solution to that, stop porting things from QMCPACK to muddy the waters and make that harder to do.

Virtual particle set looks like a complication completely unecessary to answer the questions we're asking here.

PDoakORNL commented 5 years ago

I'd also add that right now due to some troubles I initially had with pinned memory that there is extra allocation and copying on the CPU side. It's extremely fast compared to the allocation and copying with the GPU so it doesn't matter performancewise. I haven't gone back and cleaned this up, and don't intend to until I have an allocator that can hand v,grad,etc. to singleSPO's from the Crowd (or just leave it std::allocator).

ye-luo commented 5 years ago

We need the miniQMC to capture the formulas that QMCPACK does and then improve it.

I saw you were packing walkers, electron-ion pairs, and QP points in a buffer. I mean this batching over walkers. You did later dispatch the whole batch by chunks which is just an implementation detail. https://github.com/QMCPACK/miniqmc/blob/1a3493eea0b169c443b32e3ec64f02689db7cf6a/src/Drivers/CrowdCUDA.cpp#L45

To implement algorithms like size consistent Tmove, the electron loop must be sequential because T-move can be accept or reject, so electron loop can not collapse the ion loop which you did when packing.

A lot of your design choice is good for miniQMC but not fit the need of the math in QMCPACK. What I'm proposing is one implementation which can be used by several algorithms in the code to maximize code reuse. If the physics or math requires me to specialize code, I have to do it. If just for implementation and getting the best efficiency, I'd rather have one a bit slower implementation serving many cases.

PDoakORNL commented 5 years ago

As far as "I noticed that having SPO evaluation directly in the driver misguide you to separate batching SPO evaluation from wavefunction. To represent QMCPACK, I will move it into deterimant evaluation once I drop the delayed update to miniQMC." How about instead you draw a diagram showing different blocks of data and the 'functions' applied to them. As far as I can see determinants need know nothing about particle_set, or phi, they only need V, grad, etc. The way it is in QMCPACK tangles up the concerns of the classes, ignores sensible encapsulation and makes writing flexible code much more difficult.

ye-luo commented 5 years ago

See formula (4.2) assuming n=1 there is only 1 up and 1 down spin determinant. (4.3) a determinant is constructed by SPOSet the result v of all the particles. In QMCPACK,

DiractDeterminant::ratioGrad(ParticleSet& P, int jel, GradType& grad_jel)
{
  // depending on the SPOSet, it may use P.R[iat] or using electron ion distance as well like LCAO.
  SPOSet::evaluate(P, jel, Val, Grad, Hess);
  // invRow is a row of the Slater matrix inverse transpose psiM.
  grad_jel = dot(invRow, Grad);
  return dot(invRow, psiV);
}

So most functions in DiractDeterminant depends on the results of SPOSet which depends on ParticleSet. Delayed update is a bit more involving but the dependency is the same.

PDoakORNL commented 5 years ago

For flexibility in the crowd controlling how computation is done. This assume synchronization above this stack frame so V and G are valid when used.

DiractDeterminant::ratioGrad(VType VFromSPOSetEvaluation, GradType GradFromsposetEvaluate, GradType& grad_jel)
{
  grad_jel = dot(invRow, Grad);
  return dot(invRow, psiV);
}
PDoakORNL commented 5 years ago

The arguments from the SPO evaluation could also be a "future" that would block if SPO evaluation was not yet complete.

ye-luo commented 5 years ago

ratioGrad is a common interface for single determinant, multi deteriminant, jastrow factors because mathematically, ratios can be multiplied together and grads can be added together. The way you are doing requires different ratioGrad argument for each component because they need different data to compute ratioGrad.