Flexible batched moves - Githubissues

ye-luo commented 5 years ago

This PR enables flexible batching of B batches of nw_B walkers on miniqmc_sync_move B batches are dispatched by first level threads. nw_B walkers in a batched go through flex_XXX interfaces. Inside flex_XXX interface, code path is selected based on nw_B. When nw_B==1, it goes to the single walker route which may contain nested OpenMP threads splitting a walker. When nw_B>1, it goes to the batched multi walker routine which has a fallback solution of OpenMP threads over walkers. In this way, the flexible batching is achieved with one unified driver.

// 16 batches of 1 walker behaves the same as miniqmc
OMP_NUM_THREADS=16 miniqmc_sync_move -g "2 2 1"
// 4 batches of 4 walkers
OMP_NUM_THREADS=4 miniqmc_sync_move -g "2 2 1" -B 4
// 4 batches of 4 walkers, each batch using 4 nested OpenMP threads on the second level
OMP_NESTED=TRUE OMP_NUM_THREADS=4,4 miniqmc_sync_move -g "2 2 1" -B 4

In addition, spo is moved from Mover to Wavefunction. Bugfix: spo.evaluate_v(P.R[iel]) to spo.evaluate_v(P.activePos)

ye-luo commented 5 years ago

@lshulen is this interesting for you?

lshulen commented 5 years ago

I haven't had a chance to read this yet, but I am very interested. I'm not sure what to think about Peter's comment. There are things in here that I consider very interesting here, but I don't like making develop any more dependent on OpenMP. I'm already having to strip that out in many places in the Kokkos branch. What about having a new OpenMP branch that is distinct from the offload one?

lshulen commented 5 years ago

I have a few comments after reading a bit more. There seem to be at least three big things here.

Batched moves with multiple levels of openMP threading
moving spo into wavefunction
fixing bug regarding activeParticle

So, I still think 1 belongs in a separate fork / branch. 2 and 3 I think should go into develop with perhaps a bit more discussion.

Regarding point 2. Would things be more realistic to take this a bit further and actually have the computation of SPO be done inside slaterdeterminant? I realize this might be a bit more opaque for non-QMC people reading the mini app, but it would potentially force some consideration of data locality / movement into the design that is otherwise lacking.

Regarding point 3. Of course fixing the bug is important. I have been wondering a bit about the design of particleset with regards to activeParticle and activePos though. In considering things like batching for the NLPP, I have been feeling that simply having a single activeParticle / activePos can be a bit too confining depending on how you want to do the batching. Perhaps one could consider either having these be vectors / arrays of a given length or maybe even just creating "proposed move" objects that would carry this information and could then be handed back to the particleset if the move was accepted?

ye-luo commented 5 years ago

@lshulen

I haven't had a chance to read this yet, but I am very interested. I'm not sure what to think about Peter's comment. There are things in here that I consider very interesting here, but I don't like making develop any more dependent on OpenMP. I'm already having to strip that out in many places in the Kokkos branch. What about having a new OpenMP branch that is distinct from the offload one?

Peter has a design of doing more abstractions in the code which is different from what I'm doing here. We will discuss that in a WG meeting probably next week. (This week is all hands)
At the algorithm level how to map the execution to hardware, Peter and I both found that have batches of walkers is very beneficial.
I think the baseline code needs to be a threaded code not a serial code. On the host side, the use of OpenMP is very light. I think we don't use anything more than #pragma omp parallel for. It apprears only in the drivers and the fallback solutions. Peter found it easy to be replaced with other threading models. We need one threading model to handle the batches. How did you manage host threading?

lshulen commented 5 years ago

@ye-luo

sure
I'm not expressing myself clearly here. Right now we are putting random numbers into the values that are calculated in the determinant part of the code. I'd like to change so that we put actual values coming from the spos into the determinant. This means that calls to determinant evaluations will actually have to look at the value of the position and will actually get data from the spo evaluation. To me this is more like what QMCPACK actually does and forces a more honest treatment of data movement in our implementations.
I disagree. I would prefer that develop contains the simplest possible expression of the algorithm that we need to implement in our forks. Of course for performance testing we should compare to a solid OpenMP / threaded implementation on the host, but in my opinion, not all of our forks should have to build on this threaded implementation as the start. I'm using OpenMP to do the threading on the CPU myself, but the presence of the existing OpenMP pragmas causes a problem for me as Kokkos wants to manage this itself.

ye-luo commented 5 years ago

@lshulen regarding splitting the PR. I can actually PR 2 (move spo) and 3(fix bug). However since Peter started form an old branch and the new abstraction he built drastically changes the code. Any change to the develop (probably already) causes a lot of conflict. So He was asking for a hold.

Regarding putting the SPO into determinant set, this will not happen now but it should be done when delayed update is introduced to the develop. On the other hand, the WaveFunction is more complicated than needed because we don't put the same class hierarchy as QMCPACK. This also needs a fix.

Regarding batched NLPP algorithm, QMCPACK has a much better design using VirtualPaticleSet. It is a much more consistent way of handle batching by constructing a set of virtual particles with all the distances updated consistently not just the positions. This is at the cost of memory but the number of quadrature points never increases with system size. So it is fine. At the moment, I introduced a option -P just skipping the NLPP. I don't think it is worth the effort in this part on miniQMC now.

ye-luo commented 5 years ago

@lshulen you expressed yourself very clear. It is me just typing slow.

I disagree. I would prefer that develop contains the simplest possible expression of the algorithm that we need to implement in our forks. Of course for performance testing we should compare to a solid OpenMP / threaded implementation on the host, but in my opinion, not all of our forks should have to build on this threaded implementation as the start. I'm using OpenMP to do the threading on the CPU myself, but the presence of the existing OpenMP pragmas causes a problem for me as Kokkos wants to manage this itself.

I didn't see any disagreement. I'm providing a baseline to express the parallelism with minimal OpenMP. Any fork feel free to change any piece of code. std:threads? We don't know what will happen in the feature. For a particular experiment you are working on, replacing the OpenMP is the correct first step.

PDoakORNL commented 5 years ago

Luke your third issue is a good reason to have an actual functional abstractions around threading. This is actually something that could come into develop without much disruption. Since I'm at the point where I really have a hard fork at this point i.e. it makes more sense to rename develop and one_code than actually merge. If you and @ye-luo are going to actually share develop I won't block that, but I think it would be more polite if you just began to share a trunk outside of develop.

As far as @lshulen #3, while its a little tricky I am trying to keep the build such that Devices that need to be mutually exclusive can avoid seeing each others headers and therefore the need for any ifdef's within those headers, without losing the ability to over time share increasing generic code. Certainly for open MP this is possible. I can swap anything using the basic omp::parallel for an abstraction that instead depends on std::thread. Haven't accomplished such for a kokkos parallel section but as I understand it perhaps a top level of parallelization based on cpu threads running basically autonomous batches isn't the best.

ye-luo commented 5 years ago

I'd like to merge this PR which has be explained most during today's call. I'd like to replace miniqmc with miniqmc_sync_move which carries the flexible driver and rename the old miniqmc to miniqmc_old. Any concern?

lshulen commented 5 years ago

I'm OK with merging. Peter?

PDoakORNL commented 5 years ago

See my comments above. I don't feel any different.

PDoakORNL commented 5 years ago

On futher examination I think this can go in so the miniqmc_sync_move matters, but I don't like the API changes. At the same time some of the functions are now probably easier to port to the one_code design than before. So that is interesting.

QMCPACK / miniqmc

Flexible batched moves #217