Closed ye-luo closed 5 years ago
Already done for CUDA certainly don't want to see that work redone as well.
In QMCPACK, batching is not just using particle positions and it needs to work through the call stacks. Some batched NLPP operations also need distance tables. The algorithm can be used different between LCAO and splines. Once I PR, I will show you why that it needs to be done that way.
I wrote down How-does-QMCPACK-evaluate-pseudopotentials.
miniQMC is basically mimicking the code path without VritualParticleSet. I was thinking of porting the VritualParticleSet to miniQMC.
@PDoakORNL what you did Crowd<Devices::CUDA>::calcNLPP
is very similar to QMCPACK CUDA code. There are a few drawback in the CUDA code.
NonLocalECPotential::evaluate
and NonLocalECPComponent::evaluateOne
is needed to handle batching. The QMCPACK CUDA code and CPU code has a significant divergence due to batching.PS: I noticed that having SPO evaluation directly in the driver misguide you to separate batching SPO evaluation from wavefunction. To represent QMCPACK, I will move it into deterimant evaluation once I drop the delayed update to miniQMC.
Effiecient batching is device implementation dependent. A particular Crowd
My CUDA implementation is a WIP but "good enough" anything hardcoded is for volta and would need to be varied based on device detection i.e. device dependent code. Without concurrent batches the CUDA is not efficient. You must overlap transfer and kernels or the CUDA eval is pointless. The batching is not over walkers, the batch just provides enough data that its worth doing even for small systems (if you have enough walkers per thread). Other than scope this is a good reason for the Crowd to manage the batch evals and memory resources used for it.
I don't consider an implementation where multi algorithms can be selected between to be useful. Propose a solution to that, stop porting things from QMCPACK to muddy the waters and make that harder to do.
Virtual particle set looks like a complication completely unecessary to answer the questions we're asking here.
I'd also add that right now due to some troubles I initially had with pinned memory that there is extra allocation and copying on the CPU side. It's extremely fast compared to the allocation and copying with the GPU so it doesn't matter performancewise. I haven't gone back and cleaned this up, and don't intend to until I have an allocator that can hand v,grad,etc. to singleSPO's from the Crowd (or just leave it std::allocator).
We need the miniQMC to capture the formulas that QMCPACK does and then improve it.
I saw you were packing walkers, electron-ion pairs, and QP points in a buffer. I mean this batching over walkers. You did later dispatch the whole batch by chunks which is just an implementation detail. https://github.com/QMCPACK/miniqmc/blob/1a3493eea0b169c443b32e3ec64f02689db7cf6a/src/Drivers/CrowdCUDA.cpp#L45
To implement algorithms like size consistent Tmove, the electron loop must be sequential because T-move can be accept or reject, so electron loop can not collapse the ion loop which you did when packing.
A lot of your design choice is good for miniQMC but not fit the need of the math in QMCPACK. What I'm proposing is one implementation which can be used by several algorithms in the code to maximize code reuse. If the physics or math requires me to specialize code, I have to do it. If just for implementation and getting the best efficiency, I'd rather have one a bit slower implementation serving many cases.
As far as "I noticed that having SPO evaluation directly in the driver misguide you to separate batching SPO evaluation from wavefunction. To represent QMCPACK, I will move it into deterimant evaluation once I drop the delayed update to miniQMC." How about instead you draw a diagram showing different blocks of data and the 'functions' applied to them. As far as I can see determinants need know nothing about particle_set, or phi, they only need V, grad, etc. The way it is in QMCPACK tangles up the concerns of the classes, ignores sensible encapsulation and makes writing flexible code much more difficult.
See formula (4.2) assuming n=1 there is only 1 up and 1 down spin determinant. (4.3) a determinant is constructed by SPOSet the result v of all the particles. In QMCPACK,
DiractDeterminant::ratioGrad(ParticleSet& P, int jel, GradType& grad_jel)
{
// depending on the SPOSet, it may use P.R[iat] or using electron ion distance as well like LCAO.
SPOSet::evaluate(P, jel, Val, Grad, Hess);
// invRow is a row of the Slater matrix inverse transpose psiM.
grad_jel = dot(invRow, Grad);
return dot(invRow, psiV);
}
So most functions in DiractDeterminant depends on the results of SPOSet which depends on ParticleSet. Delayed update is a bit more involving but the dependency is the same.
For flexibility in the crowd controlling how computation is done. This assume synchronization above this stack frame so V and G are valid when used.
DiractDeterminant::ratioGrad(VType VFromSPOSetEvaluation, GradType GradFromsposetEvaluate, GradType& grad_jel)
{
grad_jel = dot(invRow, Grad);
return dot(invRow, psiV);
}
The arguments from the SPO evaluation could also be a "future" that would block if SPO evaluation was not yet complete.
ratioGrad is a common interface for single determinant, multi deteriminant, jastrow factors because mathematically, ratios can be multiplied together and grads can be added together. The way you are doing requires different ratioGrad argument for each component because they need different data to compute ratioGrad.
Currently evaluating NLPP is implemented very naively in miniQMC. The batched evaluation of NLPP in QMCPACK is a more clean and general implentation. It needs to be moved.