Closed ye-luo closed 5 years ago
@lshulen is this interesting for you?
I haven't had a chance to read this yet, but I am very interested. I'm not sure what to think about Peter's comment. There are things in here that I consider very interesting here, but I don't like making develop any more dependent on OpenMP. I'm already having to strip that out in many places in the Kokkos branch. What about having a new OpenMP branch that is distinct from the offload one?
I have a few comments after reading a bit more. There seem to be at least three big things here.
So, I still think 1 belongs in a separate fork / branch. 2 and 3 I think should go into develop with perhaps a bit more discussion.
Regarding point 2. Would things be more realistic to take this a bit further and actually have the computation of SPO be done inside slaterdeterminant? I realize this might be a bit more opaque for non-QMC people reading the mini app, but it would potentially force some consideration of data locality / movement into the design that is otherwise lacking.
Regarding point 3. Of course fixing the bug is important. I have been wondering a bit about the design of particleset with regards to activeParticle and activePos though. In considering things like batching for the NLPP, I have been feeling that simply having a single activeParticle / activePos can be a bit too confining depending on how you want to do the batching. Perhaps one could consider either having these be vectors / arrays of a given length or maybe even just creating "proposed move" objects that would carry this information and could then be handed back to the particleset if the move was accepted?
@lshulen
I haven't had a chance to read this yet, but I am very interested. I'm not sure what to think about Peter's comment. There are things in here that I consider very interesting here, but I don't like making develop any more dependent on OpenMP. I'm already having to strip that out in many places in the Kokkos branch. What about having a new OpenMP branch that is distinct from the offload one?
@ye-luo
@lshulen regarding splitting the PR. I can actually PR 2 (move spo) and 3(fix bug). However since Peter started form an old branch and the new abstraction he built drastically changes the code. Any change to the develop (probably already) causes a lot of conflict. So He was asking for a hold.
Regarding putting the SPO into determinant set, this will not happen now but it should be done when delayed update is introduced to the develop. On the other hand, the WaveFunction is more complicated than needed because we don't put the same class hierarchy as QMCPACK. This also needs a fix.
Regarding batched NLPP algorithm, QMCPACK has a much better design using VirtualPaticleSet. It is a much more consistent way of handle batching by constructing a set of virtual particles with all the distances updated consistently not just the positions. This is at the cost of memory but the number of quadrature points never increases with system size. So it is fine. At the moment, I introduced a option -P just skipping the NLPP. I don't think it is worth the effort in this part on miniQMC now.
@lshulen you expressed yourself very clear. It is me just typing slow.
I disagree. I would prefer that develop contains the simplest possible expression of the algorithm that we need to implement in our forks. Of course for performance testing we should compare to a solid OpenMP / threaded implementation on the host, but in my opinion, not all of our forks should have to build on this threaded implementation as the start. I'm using OpenMP to do the threading on the CPU myself, but the presence of the existing OpenMP pragmas causes a problem for me as Kokkos wants to manage this itself.
I didn't see any disagreement. I'm providing a baseline to express the parallelism with minimal OpenMP. Any fork feel free to change any piece of code. std:threads? We don't know what will happen in the feature. For a particular experiment you are working on, replacing the OpenMP is the correct first step.
Luke your third issue is a good reason to have an actual functional abstractions around threading. This is actually something that could come into develop without much disruption. Since I'm at the point where I really have a hard fork at this point i.e. it makes more sense to rename develop and one_code than actually merge. If you and @ye-luo are going to actually share develop I won't block that, but I think it would be more polite if you just began to share a trunk outside of develop.
As far as @lshulen #3, while its a little tricky I am trying to keep the build such that Devices that need to be mutually exclusive can avoid seeing each others headers and therefore the need for any ifdef's within those headers, without losing the ability to over time share increasing generic code. Certainly for open MP this is possible. I can swap anything using the basic omp::parallel for an abstraction that instead depends on std::thread. Haven't accomplished such for a kokkos parallel section but as I understand it perhaps a top level of parallelization based on cpu threads running basically autonomous batches isn't the best.
I'd like to merge this PR which has be explained most during today's call. I'd like to replace miniqmc with miniqmc_sync_move which carries the flexible driver and rename the old miniqmc to miniqmc_old. Any concern?
I'm OK with merging. Peter?
See my comments above. I don't feel any different.
On futher examination I think this can go in so the miniqmc_sync_move matters, but I don't like the API changes. At the same time some of the functions are now probably easier to port to the one_code design than before. So that is interesting.
This PR enables flexible batching of B batches of nw_B walkers on miniqmc_sync_move B batches are dispatched by first level threads. nw_B walkers in a batched go through flex_XXX interfaces. Inside flex_XXX interface, code path is selected based on nw_B. When nw_B==1, it goes to the single walker route which may contain nested OpenMP threads splitting a walker. When nw_B>1, it goes to the batched multi walker routine which has a fallback solution of OpenMP threads over walkers. In this way, the flexible batching is achieved with one unified driver.
In addition, spo is moved from Mover to Wavefunction. Bugfix: spo.evaluate_v(P.R[iel]) to spo.evaluate_v(P.activePos)