QMCPACK / miniqmc

QMCPACK miniapp: a simplified real space QMC code for algorithm development, performance portability testing, and computer science experiments
Other
26 stars 34 forks source link

Extracting StandAlone kernel #253

Open TApplencourt opened 4 years ago

TApplencourt commented 4 years ago

Hi,

I open this issue to discuss the possibility of extracting key miniQMC kernels into standalone files.

Indeed having some standalone kernels will help the collaboration between QMCPACK and other ECP projects/vendors.
Those kernels will be easy to install, to benchmark, and port to the different programming models. This will greatly facilitate the early exploration and validation of new hardware/software/programming model.

Regards, Thomas

markdewing commented 4 years ago

A little bit of work here https://github.com/markdewing/qmc_kernels

The only kernels present are vector add (not really qmc-specific, but the simplest kernel) and 3D spline.

Possible additional kernels

prckent commented 4 years ago

The plan is to make an official maintained QMCPACK repository with splines and updates at first. The idea is that they are clean, zero baggage, well documented and accessible for performance analysis, total refactoring, accessible by non-experts etc. We have much of the code, but which versions should @TApplencourt use to start from? I think reference cpu, cuda, gpu offload etc. would all be of interest. e.g. @PDoakORNL made fresh CUDA implementations in a fork of miniqmc...

TApplencourt commented 4 years ago

I can start with the spline of @markdewing if you (aka QMCPACK community) want.

If I understand correctly this code handle {double,single} / {real, complex} data type and many more type of spline.

My recommendation is to start with the bare minimum functionality (one type only for example) and to trim down the rest. It will make the porting / analysis easier.

prckent commented 4 years ago

Please take a careful look at the one in this repo (here, https://github.com/QMCPACK/miniqmc ). I am not sure which branch is best though - someone else will need to chime in. miniqmc knows how to setup various sizes of problems corresponding to NiO. i.e. It is realistic.

prckent commented 4 years ago

I would start with only single precision real. This is the "legacy CUDA" default in mainline and the one used in benchmarks.

TApplencourt commented 4 years ago

@markdewing does your implementation differs from miniqmc one? I would prefer to start from our has it look simpler. But if they are different in can trim down the miniqmc too.

In all case, I will use miniqmc to generate realistic problem size.

markdewing commented 4 years ago

I started from the miniqmc version.

For correctness checking, the driver prints a couple of values from the reference implementation and a couple of values from the non-reference version and the user has to compare them manually. This needs to be done better.

The nx,ny,nz and nspline parameters for a few NiO problem sizes are:

a32-e384 is 112x66x66 with 144 splines a64-e768 is 112x66x66 with 240 splines a128-e1536 is 112x66x66 with 408 splines

PDoakORNL commented 4 years ago

It would be quite easy to take these https://github.com/PDoakORNL/miniqmc/tree/one_code/src/Numerics/Spline2/test And make a standalone repo with "my" spline kernel. Should I do that?

prckent commented 4 years ago

It looks like Peter's code has CPU, CUDA and Kokkos already. Peter - are/were these all working? It might well be better for Thomas to start with these since they look like a more comprehensive starting point.

prckent commented 4 years ago

@markdewing Those spline counts look very strange to me, but perhaps I misunderstand? a32-e384 = 32 atoms and 384 electrons, so 192 electrons per spin = 192 splines. The others should be multiples of this number.

Thomas: The grid size corresponds to the primitive cell, i.e. we assume we are doing tiling for the larger cells and running bulks, as we do for the ECP and CORAL benchmarks .

PDoakORNL commented 4 years ago

Yes but probably I should merge to the main repo again. The onecode in my current branch is the current state. The Kokkos had been dropped at that point so I don’t believe it works anymore. I started to look at extracting just the batched/blocked spline eval yesterday, I think it could be made fairly compact especially if some of the variants are deleted/templated.

markdewing commented 4 years ago

@prckent I took the numbers from QMCPACK. My understanding is that the splines are complex, and depending on the k-point, some of the values are converted to two orbitals, and some are not (in assign_v). Is this correct? Maybe this is not necessary for a kernel - using real with number of splines equal to the number of SPO's is sufficient.

prckent commented 4 years ago

Yes, that explains the difference.

e.g. For the a32-e384 performance test we can see this on the line "NumDistinctOrbitals 144 numOrbs = 192" https://cdash.qmcpack.org/CDash/testDetails.php?test=7697041&build=108519

TApplencourt commented 4 years ago

It took longer than expected[*], but with Kevin, we did some progress on extracting the inner vgh-float kernel. It's really preliminary, but you can find it here: https://github.com/TApplencourt/nanoQMC.

May I ask people of this thread for review? I'm not sure If we initialize the input correctly.

Do you know about some sanity check I can run on the output to verify we don't do any stupid? (the norm should be 1, or something like that...). Now, the Hessian and gradient values look suspiciously large...

The next step is to create more robust testing, then put the outer_loop back and then porting it to multiple programming languages.


[*] I would like to be able to say that it is because I work from home and have to take care of my young child. But I far as I know, I don't have a toddler...