Future device imp - Githubissues

PDoakORNL commented 5 years ago

Demonstration of the CRTP device isolation architecture with Determinants. Protected by FUTURE_WAVEFUNCTIONS compiler define with new files in QMCWavefunctions/future. check_determinant passes and can run miniqmc.

Currently both the CPU Determinant implementation in develop and the kokkos Determinant implementation can be run without recompiling see check_determinant

If I can I will add the openmp offload branches determinant in next. Further isolation of other device code to follow.

I'm basing this on the kokkos branch because I still have yet to see a open multiplatform omp offload compiler.

Yes code is copied from the existing implementations but the intention is that the files in future should replace those from the past. I'm taking this approach to minimize the inconvenience while staying as attached to the main kokkos dev as possible i.e. there is no good reason to refuse this PR.

I welcome constructive review.

qmc-robot commented 5 years ago

Can one of the maintainers verify this patch?

ye-luo commented 5 years ago

I like the idea and I was thinking of something very similar but the implementation can be simplified. Use DiracDeterminant<CPU>, DiracDeterminant<CUDA>, DiracDeterminantBase class CPU and CUDA only defines types.

CPU::Matrix = Matrix;
CPU::Engine = DiractMatrix;
CUDA::Matrix = MatrixCUDA;
CUDA::Engine = DiractMatrixCUDA;
// we make sure the call interfaces from DiracDeterminant
// are the same between MatrixCUDA and DiractMatrixCUDA

template<typename Impl>
DiracDeterminant :: public DiracDeterminantBase
{
  Impl::Matrix psiM;
  Impl::Engine UpdateEngine;

  ratio()
  {
     UpdateEngine.ratio()
  }
}

If we want to force the interface consistency, we can put a base class for DiractMatrix and DiractMatrixCUDA but avoid using the base class directly. Is this simpler than adding all the base class pointers and using runtime polymorphism? Every thing is at compile time now.

prckent commented 5 years ago

Why not get rid of the base class at this point? We could have code like alfredo's matrix multiply example. Only the interfaces need to match.

ye-luo commented 5 years ago

@prckent In my design, the base class of DiractMatrix and DiractMatrix is only used to align interfaces. You can remove the base class since the runtime polymorphism is not used. The difference using or not using a base class is when you add a new Implementation type, you know what needs to be implemented. If you put a base class, the compiler tells you a base class virtual function is not specialized in the derived class. If there is no base class, you will get tons of error about a function is not defined at the place where the function is called and then good luck to find out what is actually needed. I guess C++ concepts do very similar things although I'm not very familiar with it. The base class DiracDeterminantBase is important because this is used by the SlaterDeterimannts which holds two pointers of DiracDeterminantBase and dispatch to CPU or CUDA at runtime.

PDoakORNL commented 5 years ago

Here the reason for DiracDeterminant to continue to exist is so it can hold non device specific implementation. Additionally since the algorithm can be different based on for instance non batching or not this is where that templating would take place. It is also what you compose the device object into.

It may not be clear at this point what the point of the DeterminantDeviceImp class template is but this allows us to hide device specific symbols from code that would otherwise see them due it device implementations needing to appear in the include path.

But this should probably be discussed in an issue. This PR should be merged so the kokkos branch doesn't become a fragmented mess and changes for anyone else working on it are easy to merge. I would rather not just merge it myself.

ye-luo commented 5 years ago

Agreed that DiracDeterminant<> will be needed to handle the algorithm. In my design, I expect SPO evaluation handled in DiracDeterminant but the determinant matrix operation inside will be specialized. So we can enable some flexibility having part of code offloaded but partly on the host. @PDoakORNL I'm not the maintainer of Kokkos branch. I will leave the merge decision to the Kokkos team. Anyway, you can always push a new branch to the repo.

lshulen commented 5 years ago

I had a few questions on the other pull request that I will leave there. (basically I don't understand how or why removing the multiple operator() definitions works). That said, I have a few questions here as well.

Basically I'm confused as to the intent. Is the idea that you could have multiple implementations so that we could have both kokkos and openmp offload and raja.... backends, ie you are isolating the programming model rather than the target device? Making a quick read through of the code, it seems like you are trying to have separate CPU / GPU backends rather than programming model isolation.

PDoakORNL commented 5 years ago

This is just a demonstration since its harder to understand the point of things in the main code. But I think we do want to have a good architecture for our physics problem not a particular programming model. From my perspective each programming model might as well be a different backend. They require proprietary data structures, are only supported well by certain compilers, vendors and may be abandoned in the future. Performanc portability models may give us say 80% of the installs we want but if they are in the way for crucial HPC systems then... ?

So far we have pursued different device implementations by sprinkling ifdef's all over the code that change the interfaces of central objects in the code. Code that could be generic cannot be.

If I could get the KOKKOS to build with CUDA I would implement that as a separate DeterminantDeviceImp which would offer an interesting opportunity to look at how to share code between them.

ye-luo commented 5 years ago

If I don't put template but use inheritance directly, I can still achieve all the effect. class DiractDeterminant :: public WaveFunctionComponent borrows all the interface by using This is a base class with pure functions only and used by the upper level SlaterDet which hosts pointers. For specific implemenation, I can have

class DiractDeterminantCPU :: DiractDeterminant
class DiractDeterminantCUDA:: DiractDeterminant
class DiractDeterminantKOKKOS:: DiractDeterminant

Then all the device/implemenation specific data structures can still be placed inside each derived class. @PDoakORNL Do I loose any flexibility?

ye-luo commented 5 years ago

I think we can use template the following way and and make the code more readable.

template<typename IMPL>
  struct Implementation { }

template<>
  struct Implementation<CPU>
  {
    typedef DiractDeterminantCPU DetClass;
    typedef TwoBodyJastrowCPU J2Class;
  };

template<>
  struct Implementation<CUDA>
  {
    typedef DiractDeterminantCUDA DetClass;
    typedef TwoBodyJastrowCUDA J2Class;
  };

template<>
  struct ImplementationClass<KOKKOS>
  {
    typedef DiractDeterminantKOKKOS DetClass;
    typedef TwoBodyJastrowKOKKOS J2Class;
  };

// then when you use
DiractDeterminant* det = new Implementation<CPU>::DetClass
TwoBodyJastrow* J2 = new Implementation<CUDA>::J2Class

PDoakORNL commented 5 years ago

This is clever. I'm not sure it really is that straight forward though. Leaving aside that I think the implementation should be composed into the DiracDeterminant object.

A situation like this where generic code could be easily written:

template<ENUMDeviceType DT>
class Implementation
{
  int foo(int x);
  int bar(int y);
}

template<ENUMDeviceType DT>
void Implementation<DT>::foo(x) 
{ 
  ...
  return another<DT>.doSomething(x); }
}

template<>
int Implementation<CPU>::bar(int y) {...specialized...}
template <>
int Implementation<CUDA>::bar(int y) {...specialized...}

It seems more difficult with the typedef resolution of the device type. I haven't used a straight C++ function ptr for a bit so

class ImpBase
{
  int foo(x, (int) (*funcPtr)(int) )
  { ...
    return funcPtr(x); 
  }
}

class ImplementationCPU : public ImpBase
{
  int foo(int x) { return foo(x, anotherCPU::doSomething); }
  int bar(int y);
}

class ImplementationCUDA : public ImpBase
{
  int foo(int x) { return foo(x, anotherCUDA::doSomething); }
  int bar(int y);
}

It seems awkward to do something like this using inheritance and the implementation typedef's leave the actual classes untemplated. Although I could be wrong and its easy to write. I might be wrong but somehow I feel like the funcPtr which only requires the correct signature is a little sketchy as well.

markdewing commented 5 years ago

The DeterminantDevice class looks like a compile-time version of the pImpl idiom. (Just an observation)

What's the issue with using regular inheritance for DeterminantDevice and specific implementations as derived classes?

ye-luo commented 5 years ago

@markdewing Do you refer "using regular inheritance" as the example I gave above? https://github.com/QMCPACK/miniqmc/pull/198#issuecomment-437204422

markdewing commented 5 years ago

@ye-luo Yes, I realize now that your example is the same as the question I had.

It looks like defining DiracDeterminant as an more stable API might work - put all the offload and device-specifics in the implementation class. However, this doesn't address the question of batching. The original issue was the CUDA code and the CPU code have two different interfaces. Do we need to add batching to miniqmc so this can be explored?

prckent commented 5 years ago

We do need a clean exploration of batching and simple performance data + model. Worth discussing with Kokkos & OpenMP folks about how to do usefully in the reference version. What would maximize reuse?

ye-luo commented 5 years ago

@markdewing I put a prototype how to do batching in miniQMC. All the multi_XXX interfaces have a fall back and allows specialization. It is currently neutral to any implementation.

PDoakORNL commented 5 years ago

I am working on the batching implementation in miniqmc next. Please stay on topic for now. I began to implement that in QMCPACK but I will not be syncing that again. It's clear until I completely convince at least some of the Argonne side no incremental changes can be put in. That's going to require you all wait for me to be done with my version of miniqmc.

As for the devices there is an implementation hiding intention here in that details of the device implementations should not propagate through the code. But there is also an intention that multiple implementations be available and selectable without the need for manual branching code being introduced. I've got that working in my next drop. Unfortunately I got into einspline_spo as well which mixes high level and implementation code in all the implementations and that is going to take awhile to separate out into device and higher level code.

Ifdef's in implementations (i.e. CUDA or not, KOKKOS or not, OFFLOAD or not) should be minimized as they produce code that can only do one thing or the other. These should be handled through template specializations and not be mutually exclusive.

While I think the organizational and runtime coexistence issues are most important I'd also point out that CRTP polymorphism is far more performant than v table based polymorphism. https://eli.thegreenplace.net/2013/12/05/the-cost-of-dynamic-virtual-calls-vs-static-crtp-dispatch-in-c Runtime branching and type checking should be minimized. Use of virtual inheritance should be minimized.

I think the proper interfaces will become clear while refactoring, there is no need to do anything but do that and see what we actually need. Changing them is not a big deal since we control the code on both sides, we are not a vendor shipping a library. There is another significant chunk of this coming today. Then I will demonstrate how to deal with the batching. You can go look in my repo if you are interested but I've no interest in comments while I'm working.

PDoakORNL commented 5 years ago

Ok I just went back and tried to push when the non manual switching worked but before the einspline_spo. Pushing boost::hana to external libraries buries everything. I will prepare a new PR soon without hana included.

ye-luo commented 5 years ago

The overhead of vtable is not something we really care. In the current QMCPACK setup, we don't see any cost of using runtime polymorphism. We do see the slow compilation in the SplineAdopter portion due to templates. This is probably more critical. If you find DiracDeterminant/einspline_spo too difficult to work with, you can try the two body Jastrow which also have Kokkos specific code.

QMCPACK / miniqmc

Future device imp #198