Vectorized algorithms not implemented for 1D and 2D geometry (even with the non-adaptive method)

michaeltouati commented 3 years ago

  Dear developpers,

Do you plan to add vectorization of SMILEI for 1D and 2D Cartesian geometry soon? Since it is already implemented for simulation using a 3D Cartesian geometry, I think it wouldn't take a lot of time to extend it to the simpler 1D and 2D Cartesian geometry cases. Waiting for your answer, Best wishes, Michaël

mccoys commented 3 years ago

In fact, vectorization is available in 2D, but not in the adaptive mode. This would require some more investigations. Summary:

In 1D: there is no vectorization available yet.
In 2D: vectorization is available but not in adaptive mode.
In 3D: vectorization is available.
In AM: there is no vectorization available yet.

michaeltouati commented 3 years ago

I'm unlucky. I was interested in 1D Cartesian expensive simulations on a desktop using the threadripper AMD technology (64 cores). I asked the question to know if the developers who implemented the 2D and 3D Cartesian case has planned to do it in 1D fastly (week time scale) or not. Depending on if yes or not, I'm eventually willing to do it (months time scale).

beck-llr commented 3 years ago

There is a reason why it has not been done yet. We estimate that the performance improvement would probably not be significant. We could be wrong of course.

We are currently working on vectorization for AM geometry but there is no schedule for vectorization in 1D for the moment.

michaeltouati commented 3 years ago

  Re-bonjour,

"There is a reason why it has not been done yet. We estimate that the performance improvement would probably not be significant. We could be wrong of course."

The way I see it, it should accelerate every loops even in 1D Cartesian geometry!

"We are currently working on vectorization for AM geometry but there is no schedule for vectorization in 1D for the moment."

Ok. I'll have a look about what has been done in 2D Cartesian case and I'll try to extend it to the 1D one. Can you tell me the source files that should be modified accordingly? I would earn a lot of time.

En te remerciant, Best regards, Michaël

MickaelGrech commented 3 years ago

Ciao Michaël, Just to briefly come back to what @beck-llr wrote, it is not so much that one would expect a small relative gain in 1D, the relative gain could indeed be large in 1D, in particular as one can usually consider very large number of particles per cell. It's more that the absolute gain in CPU time would not be so large at the end for most cases. Indeed, most 1D simulations can usually run really fast, if not on your workstation than just getting to a medium size cluster/supercomputer. When you have finite manpower, you need to make choices :D here, we chose to spent more CPU time & less human/dev time. This being said, are you sure it's a smart move to get involved in this vectorization work?

michaeltouati commented 3 years ago

Coucou Micka, J'espère que tu vas bien.

Yes, it would be wonderful if it is implemented in 1D.

I'm running expensive 1D-3V PIC simulations of TNSA using SMILEI resolving the few nm of rear side layer of Aluminum oxide contaminated by water vapor and hydrocarbures diffusion inside it.

I've already run the simulations with laser pulses from 30 fs FWHM to few hundreds of fs FWHM interacting with a 10 microns thick Aluminum target on tesseract of DIRAC supercomputing center in UK.

I wanted to run the less expensive simulations of 3 microns thick targets with the same pulse durations as mentioned before on my nice computer that has the AMD threadripper CPU (64 cores).

The vectorization would make it more or less ~4 times faster the simulations in both cases according to my experience using OSIRIS using AVX or AVX2 and GNU/OpenMPI compilers concerning other kind of simulations (mainly relativistic collisionless shocks).

I fully understand the problem of human resources. That is why I proposed myself to implement it if it was not plan on the short time scale. I would however earn time if you tell me the source files that should be modified (1D Yee scheme, 1D Esirkepov, 1D current deposit and 1D interpolations of fields and related input deck parameters management). I can basically copy / past what has already been done for the 2D Cartesian case. For sure, it won't be with the "adaptative mode" in a first time such as the already implemented 2D Cartesian case.

Wishing you a good afternoon, Best wishes, Michaël

xxirii commented 3 years ago

Hi Michael,

I can assist you in this work. First I would like to mention few things:

1D vectorization has never been tried because we assumed the gain would be low and 1D simulations are never expensive. Obviously the last assumption is wrong if you have a tremendous number of particles, long simulations or/and do parametric studies (ensemble runs).
Current vectorization has been tuned for Intel compilers and it is very efficient compiled with the Intel compiler. Recent work on ARM and AMD processor shows that the code needs some small modification to run with full vectoization efficiency hen compiled with LLVM or GNU. On AMD processor, Smilei performance (compiled with Intel) is really good but we can see that the difference between scalar and vector modes is now as significant as on Intel processors. This is partly the consequence of the smaller vector register size (AVX2).
You will surely not be able to gain a speedup of 4. First because some parts of the code can not be vectorized and then because the interpolator and projector operators have specific memory access patterns that slow down vecto efficiency.
Note that vectorization induces sorting of the particle. This sorting induces an overhead that has to be compensated by vectorization.

Where you need to work:

The main operators that you have to vectorize are the interpolator and the projector. The pusher is already vectorized for all geometries. Maxwell solver is partially vectorized and quick enough.

You will have to work in the directories src/Interpolator (https://github.com/SmileiPIC/Smilei/tree/master/src/Interpolator) and src/Projector (https://github.com/SmileiPIC/Smilei/tree/master/src/Projector).

If we look at the Interpolator directory, you will see that there is a class per geometry (1D, 2D 3D...), order (2 and 4) and level of optimization (scalar or vecto). For instance Interpolator3D4Order.cpp is the class for the 3D order 4 interpolation and Interpolator3D4OrderV.cpp the same but vectorized.

You will therefore have to create a class called Interpolator1D2OrderV.cpp (at order 2). You can then use the vectorized algorithms in 2D and 3D to create a 1D version. You can also use our paper on vectorization to assist you.

Then to be able to use this class, you have to modify the factory: InterpolatorFactory.h. The factory determines according to the user input the correct functor to call. For instance, if you look at 3D:

        else if( ( params.geometry == "3Dcartesian" ) && ( params.interpolation_order == 2 ) ) {
            if( !vectorization ) {
                Interp = new Interpolator3D2Order( params, patch );
            }
#ifdef _VECTO
            else {
                Interp = new Interpolator3D2OrderV( params, patch );
            }
#endif

you can see that there is a if-condition specific to vectorization, you have to do the same for 1D.

The projector works the same way.

michaeltouati commented 3 years ago

Hello Xxirii,

Thank you very much for all these important pieces of information. It is very clear and, without it, it would have been impossible for me to implement it.

Do you confirm https://doi.org/10.1016/j.cpc.2019.05.001 is the paper that you mentioned?

xxirii commented 3 years ago

Hi,

Do not hesitate to ask any question. Here or in Element. I confirm this is the paper I was mentioning.

michaeltouati commented 3 years ago

Merci beaucoup

xxirii commented 3 years ago

@michaeltouati

For your information, I will test in the next few days the possibility of 1D vectorized operators. I will let you know about the efficiency.

xxirii commented 3 years ago

So, I have tried a vectorized version of the interpolator. You can switch the content of the function Interpolator1D2Order::fieldsWrapper by the following:

void Interpolator1D2Order::fieldsWrapper( ElectroMagn *EMfields, Particles &particles,
                                          SmileiMPI *smpi, int *istart, int *iend, int ithread, int ipart_ref )
{
    std::vector<double> *Epart = &( smpi->dynamics_Epart[ithread] );
    std::vector<double> *Bpart = &( smpi->dynamics_Bpart[ithread] );
    int *iold = &( smpi->dynamics_iold[ithread][0] );
    double *delta = &( smpi->dynamics_deltaold[ithread][0] );

    int nparts = particles.size();

    double * __restrict__ position_x = particles.getPtrPosition(0);

    double * __restrict__ Epart_x= &( smpi->dynamics_Epart[ithread][0*nparts] );
    double * __restrict__ Epart_y= &( smpi->dynamics_Epart[ithread][1*nparts] );
    double * __restrict__ Epart_z= &( smpi->dynamics_Epart[ithread][2*nparts] );

    double * __restrict__ Bpart_x= &( smpi->dynamics_Bpart[ithread][0*nparts] );
    double * __restrict__ Bpart_y= &( smpi->dynamics_Bpart[ithread][1*nparts] );
    double * __restrict__ Bpart_z= &( smpi->dynamics_Bpart[ithread][2*nparts] );

    double * __restrict__ Ex = EMfields->Ex_->data();
    double * __restrict__ Ey = EMfields->Ey_->data();
    double * __restrict__ Ez = EMfields->Ez_->data();
    double * __restrict__ Bx = EMfields->Bx_m->data();
    double * __restrict__ By = EMfields->By_m->data();
    double * __restrict__ Bz = EMfields->Bz_m->data();

    int idx;  //dual index computed
    int ipx;  //prim index computed
    double xjn;
    double xjmxi2;

    double coeffd[3];
    double coeffp[3];

    #pragma omp simd private(xjn, xjmxi2, coeffd, coeffp, idx, ipx)
    for( int ipart=*istart ; ipart<*iend; ipart++ ) {
        //Interpolation on current particle

        // Particle position (in units of the spatial-step)
        xjn = position_x[ipart]*dx_inv_;

        // Dual
        idx      = round( xjn+0.5 );      // index of the central point
        xjmxi  = xjn - ( double )idx +0.5; // normalized distance to the central node
        xjmxi2 = xjmxi*xjmxi;            // square of the normalized distance to the central node

        // 2nd order interpolation on 3 nodes
        coeffd[0] = 0.5 * ( xjmxi2-xjmxi+0.25 );
        coeffd[1] = ( 0.75-xjmxi2 );
        coeffd[2] = 0.5 * ( xjmxi2+xjmxi+0.25 );

        idx -= index_domain_begin;

        // Primal
        ipx      = round( xjn );    // index of the central point
        xjmxi  = xjn -( double )ipx; // normalized distance to the central node
        xjmxi2 = xjmxi*xjmxi;   // square of the normalized distance to the central node

        // 2nd order interpolation on 3 nodes
        coeffp[0] = 0.5 * ( xjmxi2-xjmxi+0.25 );
        coeffp[1] = ( 0.75-xjmxi2 );
        coeffp[2] = 0.5 * ( xjmxi2+xjmxi+0.25 );

        ipx -= index_domain_begin;

        // // Interpolate the fields from the Dual grid : Ex, By, Bz
        Epart_x[ipart] = coeffd[0] * Ex[idx-1]   + coeffd[1] * Ex[idx]   + coeffd[2] * Ex[idx+1];
        Bpart_y[ipart] = coeffd[0] * By[idx-1]   + coeffd[1] * By[idx]   + coeffd[2] * By[idx+1];
        Bpart_z[ipart] = coeffd[0] * Bz[idx-1]   + coeffd[1] * Bz[idx]   + coeffd[2] * Bz[idx+1];

        // Interpolate the fields from the Primal grid : Ey, Ez, Bx
        Epart_y[ipart] = coeffp[0] * Ey[ipx-1]   + coeffp[1] * Ey[ipx]   + coeffp[2] * Ey[ipx+1];
        Epart_z[ipart] = coeffp[0] * Ez[ipx-1]   + coeffp[1] * Ez[ipx]   + coeffp[2] * Ez[ipx+1];
        Bpart_x[ipart] = coeffp[0] * Bx[ipx-1]   + coeffp[1] * Bx[ipx]   + coeffp[2] * Bx[ipx+1];

        //Buffering of iol and delta
        iold[ipart] = ipx;
        delta[ipart] = xjmxi;
    }

}

Note that this version is supposed to update the scalar version and you don't have to activate the vectorization to use it.

I did some tests using the Intel compiler. You have a substantial gain between 1 and 128 particles per cell. The speed-up is then reduced for larger numbers. This is not the typical behavior but here we do not use the cell sorting as it is usually the case in vecto. I suspect the particles to generate more cache misses when loading the fields from the grids a they get mixed. The performance results may depend on the patch size and the case. Smaller patch will increase the patch coherency but increase the time spent in particle synchronization.

interpolator_time_per_particle_per_socket

The code is also written to provide good performance with GNU and LLVM compilers. But I did not test them yet.

mccoys commented 3 years ago

Did we ever try such large numbers of particles per cell in 2D or 3D?

michaeltouati commented 3 years ago

Thank you very much @xxirii. I will test it this week and/or next week using GNU and OpenMPI compilers together with AVX and AVX2 vectorization compiling options on my threadripper desktop.

I agree with you, it should depend on the case, the number of patches and the decomposition domain strategy, especially if no particle sorting is performed. What was the test case that you were simulating to perform this computational time scaling per particle per iteration per socket. If you were simulating a periodic collisionless plasma at Maxwell-Boltzmann equilibrium, it should have avoid this re-increase of computational time.

I agree with you @mccoys : in 2D, when resolving the Debye length to simulate a collisionless plasma, a number of macroparticles per cell higher than 128 is quite scarce so in 3D...

However, when simulating collisional plasmas using the multiple binary Coulomb collision Monte Carlo module, it is quite common to need more than 128 macroparticles per cell.

For my part, I recently had to consider more than 128 macroparticles per cell even for collisionless plasmas when simulating relativistic collisionless shocks on very large time scales in order to limit the numerical heating.

Thanking you a lot again, Best wishes, Michaël

xxirii commented 3 years ago

@mccoys : Yes, I did. In this case the time per particle in vectorized mode stabilizes or slightly increases but the difference with the scalar mode stays high.

@michaeltouati I performed this test using a homogeneous well-balanced Maxwellian plasma. I always use such a case for preliminary performance tests.

mccoys commented 2 years ago

@xxirii Is this issue partly fixed now?

xxirii commented 2 years ago

It is now really an issue but a need for features. It is partially addressed with extension of the vectorization in 2D and vectorization of the interpolator in 1D. I think we can close it and keep it in our roadmap.

SmileiPIC / Smilei

Vectorized algorithms not implemented for 1D and 2D geometry (even with the non-adaptive method) #428