Add preconditioning mass to Clover operators

martin-ueding commented 7 years ago

Ideally one has M^\dagger M + \rho^2 where \rho is the preconditioning mass. The solvers in QPhiX apply M and then M^\dagger in two steps. This \rho should not live within the solver but rather in the operator. The mathematical equivalent would be implement M + i \rho in the operator. That is rather easy:

Add another ctor argument to the clover odd-odd-operator. Give it a default value of 0.0
If that mass is non-zero, use some AXPY routines to apply this + i \rho.

kostrzewa commented 7 years ago

Sorry, I made a mistake. The shift needs a factor of gamma5 to work, since M is not hermitian.

 (M+i\rho\gamma_5)^dag (M+i\rho\gamma_5)
= (M^dag - i\rho\gamma_5) (M+ i\rho\gamma_5)
= M^dag M + i\rho M^dag \gamma_5 - i\rho \gamma_5 M + \rho^2
= M^dag M + i\rho \gamma_5 M \gamma_5 \gamma_5 - i\rho \gamma_5 M + \rho^2
= M^dag M + \rho^2

Do you agree?

martin-ueding commented 7 years ago

Here is the same thing rendered:

dagger

It looks good, so I think we need the \gamma_5. That means that it is not a BLAS operation but one would have to write a site-loop with a transformation. Also possible, will be just a bit more work. And I already see that I will try to refactor the BLAS stuff such that it is less code duplication.

bjoo commented 7 years ago

Hi Bartosz, Do you mean to the regular clover Op? Do you want to change the interface to the linop (ie always add it and have it zero, or add a branch?

Best, B

On Jun 14, 2017, at 9:18 AM, Bartosz Kostrzewa notifications@github.com wrote:

Sorry, I made a mistake. The shift needs a factor of gamma5 to work, since M is not hermitian.

(M+i\rho\gamma_5)^dag (M+i\rho\gamma_5) = (M^dag - i\rho\gamma_5) (M+ i\rho\gamma_5) = M^dag M + i\rho M^dag \gamma_5 - i\rho \gamma_5 M + \rho^2 = M^dag M + i\rho \gamma_5 M \gamma_5 \gamma_5 - i\rho \gamma_5 M + \rho^2 = M^dag M + \rho^2

Do you agree?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org

martin-ueding commented 7 years ago

This is what I had in mind:

  EvenOddCloverOperator(SU3MatrixBlock *u_[2],
                        CloverBlock *clov_,
                        CloverBlock *invclov_,
                        Geometry<FT, veclen, soalen, compress12> *geom_,
                        double t_boundary,
                        double aniso_coeff_s,
                        double aniso_coeff_t,
                        double const prec_mass = 0)

  inline void operator()(FourSpinorBlock *res,
                         const FourSpinorBlock *in,
                         int isign,
                         int target_cb = 1) const override
  {
    double beta = (double)0.25;
    int other_cb = 1 - target_cb;
    D->dslash(tmp, in, u[other_cb], invclov, isign, other_cb);
    D->dslashAChiMinusBDPsi(
        res, tmp, in, u[target_cb], clov, beta, isign, target_cb);
    if (prec_mass != 0.0) {
        // Add `i * prec_mass * gamma_5`
    }
  }

kostrzewa commented 7 years ago

Hi Balint, yes, we would change both clover and tmclover for this, but with \rho set to zero as a default parameter, so client code which doesn't use it will not be affected. There won't be any impact on performance either since the conditional is on the outside of any site loops. We use this sort of shift to precondition the clover and tmclover HMC because this allows you to compute the clover term once and re-use it.

kostrzewa commented 7 years ago

Of course, this will break with BiCGstab, but it should just not be used there.

bjoo commented 7 years ago

Yes, I think I have such a term on the regular clover op in Chroma. It would be useful to do Hasenbusch style tricks in the original way, rather than changing the mass, so this is a useful addition.

What I was wondering is whether we should fold it into the dslashAchiMinusBDPsi operator generically?

If you add it on as in the code snippet above, one will have to re-read the 'result' from memory/cache.

If it was folded into the 'A' term one could add it site by site. Is it not just an addition of i rho psi for the upper spin components and -irho psi for the lower ones? The annoyance is that then we will probably want to do it even if the factor is 0, to avoid branching and of course we would need to change the current existing achiMinusBDPsi kernel(s)

kostrzewa commented 7 years ago

If you add it on as in the code snippet above, one will have to re-read the 'result' from memory/cache. If it was folded into the 'A' term one could add it site by site. Is it not just an addition of i rho psi for the upper spin components and -irho psi for the lower ones? The annoyance is that then we will probably want to do it even if the factor is 0, to avoid branching and of course we would need to change the current existing achiMinusBDPsi kernel(s)

Indeed, we wanted to avoid affecting performance for people who don't use it. For us it would be better to fold it into achiMinusBDPsi because it would be faster. On second thought, however, given current arithmetic intensities, maybe it would be better to really actually just do that? Adding zero (or i\rho -i\rho) will have minimal impact I guess.

bjoo commented 7 years ago

Hi Bartosz, Yes, that is what I was thinking.. But it involves a bit of extra lifting… We will now need to pass rho down to the kernels, and change the various calling interfaces, etc. If someone wants to undertake this, I would be happy if they started the work rooted in the cmake-build branch. This seems to build nicely with the whole new infrastructure (if Python3 and Jinja are available) so we should think about converging on it.

Best, B

On Jun 14, 2017, at 9:43 AM, Bartosz Kostrzewa notifications@github.com wrote:

If you add it on as in the code snippet above, one will have to re-read the 'result' from memory/cache. If it was folded into the 'A' term one could add it site by site. Is it not just an addition of i rho psi for the upper spin components and -irho psi for the lower ones? The annoyance is that then we will probably want to do it even if the factor is 0, to avoid branching and of course we would need to change the current existing achiMinusBDPsi kernel(s)

Indeed, we wanted to avoid affecting performance for people who don't use it. For us it would be better to fold it into achiMinusBDPsi because it would be faster. On second thought, however, given current arithmetic intensities, maybe it would be better to actually change the kernels? Adding zero (or i\rho -i\rho) will have minimal impact I guess.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org

martin-ueding commented 7 years ago

Sure, baking that into the kernel would make sense from a performance point of view. The additional operations might not hurt in the kernel at all since our arithmetic intensity is still low.

We could of course add more kernels, effectively doubling the number of achimbdpsi kernels that we have. With the new build system, this is just a matter of compilation time, not memory; we could think about this.

The impact is a bit lower than the twisted boundary conditions, I guess. We could also make some measurements and see whether our arithmetic intensity is so low that we could get this addition for free. Then the kernels would be changed but their number would not increase.

So we can approach this in different ways:

Add it into the odd-odd operator with if (prec_mass != 0.0) and suffer from additional memory accesses. Implementation will be relatively easy and the kernels do not have to be changed.
Add it into the kernel and run those operations all the time, even if that is + 0. The arithmetic intensity is probably low enough that it works.
Generate new achimbdpsi kernels and and call the desired kernel to not waste any operations in the kernel.

I guess I will implement that eventually. And using the new build system for kernel work makes a lot of sense because the amount of code that needs to be changed will be much less. If the take the first approach, it does not matter really, I believe.

bjoo commented 7 years ago

One can do it incrementally I suppose:

I think the extra arithmetic is very minor: it is essentially 2 FMAs per spin-color componen or 48 FLOPs per site

Regarding the intensity: the combined number of operations on the site is something like 552 (for the Clover) and 1320 for the Dslash +another 2x12 FMAs for combining the two with a scale factor, so something like 1920 FLOPs. This is about a 2.5% increase in FLOPS and no change in bytes. So I think if we are memory bound we shouldn’t see any effect: we are far from the balance point. If anything this should help :) — unless it causes some register spillage or other problem.

A bigger impact would be that we’d need to change the codegen interfaces (take now the new rho parameter). probably of the type FT, and then when calling we’d need to make sure that it is set. My convention has been that users specify these as doubles, and we cast them to the ‘ArtihmeticType’ (AT) using the rep<> template prior to the call.

Best, B

On Jun 14, 2017, at 10:03 AM, Martin Ueding notifications@github.com wrote:

Sure, baking that into the kernel would make sense from a performance point of view. The additional operations might not hurt in the kernel at all since our arithmetic intensity is still low.

We could of course add more kernels, effectively doubling the number of achimbdpsi kernels that we have. With the new build system, this is just a matter of compilation time, not memory; we could think about this.

The impact is a bit lower than the twisted boundary conditions, I guess. We could also make some measurements and see whether our arithmetic intensity is so low that we could get this addition for free. Then the kernels would be changed but their number would not increase.

So we can approach this in different ways:

• Add it into the odd-odd operator with if (prec_mass != 0.0) and suffer from additional memory accesses. Implementation will be relatively easy and the kernels do not have to be changed. • Add it into the kernel and run those operations all the time, even if that is + 0. The arithmetic intensity is probably low enough that it works. • Generate new achimbdpsi kernels and and call the desired kernel to not waste any operations in the kernel. I guess I will implement that eventually. And using the new build system for kernel work makes a lot of sense because the amount of code that needs to be changed will be much less. If the take the first approach, it does not matter really, I believe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org

martin-ueding commented 7 years ago

Changing the interfaces is not that big of a deal, one just has to do this in two to three Jinja templates, the call sites in the private function of the Dslash classes and the constructors of the ClovDslash and LinearOperator classes. That should be doable pretty quickly.

The user should supply those as double and we only use ArithmeticType at the very end. Using FT is a pain because one cannot use the half precision type without the intrinsics.

Okay, so we want the implementation that is baked into the kernel and does the addition in every case, even if it is zero? That sounds like a fast (developer time) and fast (runtime) option.

bjoo commented 7 years ago

Sounds good. Martin will you undertake this? Best, B

On Jun 14, 2017, at 11:17 AM, Martin Ueding notifications@github.com wrote:

Changing the interfaces is not that big of a deal, one just has to do this in two to three Jinja templates, the call sites in the private function of the Dslash classes and the constructors of the ClovDslash and LinearOperator classes. That should be doable pretty quickly.

The user should supply those as double and we only use ArithmeticType at the very end. Using FT is a pain because one cannot use the half precision type without the intrinsics.

Okay, so we want the implementation that is baked into the kernel and does the addition in every case, even if it is zero? That sounds like a fast (developer time) and fast (runtime) option.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org

martin-ueding commented 7 years ago

This should be completed. I have tested it with a point source and it seems to correctly add it. The automated tests perform \rho = 0 and that works out.

Perhaps it would be nice to have a check against QDP++ by also adding i \rho \gamma_5 \chi to the result. I presume that is rather easy with QDP++ spinors and operator overloading. @bjoo How can I get an i and a \gamma_5 from QDP++ in the QPhiX tests?

JeffersonLab / qphix