ORNL / cpp-proposals-pub

Collaborating on papers for the ISO C++ committee - public repo
26 stars 26 forks source link

Type concerns about scaled(a, m) (conversion/mixed precision; composition of scaled()) #249

Open PhilMiller opened 2 years ago

PhilMiller commented 2 years ago

Conversation with @mhoemmen where I initially brought this up:

Phil Miller 6:36 PM There’s an interesting quirk to the folding of nested scaled applications - the product of the scaling factors will be computed in a different type than the underlying right-associated multiplication against the ElementType would produce. I think this is mostly fine and good, but I can come up with at least one example where this can lead to bad behavior. Consider mdspan<double,…> m; scaled(1<<20, scaled(1 << 20, m)) - the ProductScalingFactor is int, and its value will overflow, yielding UB, rather than scaling by 2^40 as it would appear it should, and readily could if the calculation were in double . Even unsigned 1u << 20 in both instances would produce wrong results, though not UB.

Mark Hoemmen 8:44 PM [snip] We definitely need to give those proxy references more attention in the proposal. I would ideally like those expressions to behave as if they were typed out by hand in a for loop.

Phil Miller 10:35 PM I think the wording as written gets you the behavior “as if they were typed out by hand in a for loop” - the multiplication will be left-associated, and so the product of all the scaling factors will be formed in whatever type they promote/convert to before they touch the element type. It’s not unreasonable that multiplication by a stack of integer factors that ultimately overflow an int should be something for users to avoid. It feels weird to me, I think, because the scaled(b, scaled(a, m)) expression seems to naturally correspond to a right-associated explicit parenthesization of the same - e.g. (b * (a * m)) I just thought of another case that would potentially suffer from this, just barely less contrived. Consider a painfully naive LU factorization, that operates element-by-element rather than in blocks. With each value d along the diagonal, the remainder of the matrix will be scaled by its reciprocal 1/d. The product of those reciprocals could go subnormal or underflow, even if the values being scaled would be fine as actually divided step-wise. This doesn’t involve any type conversions, just the quirk of evaluation order. The commonality between these is that multiple layers of scaled may be composed in settings where the product of the scaling factors runs up against numeric representation limits, where the value being scaled would not.

Mark Hoemmen 7:26 AM "Multiple layers of scaled " is perhaps the key phrase here. The expression templates were Christian's idea, to reduce the number of parameters that BLAS functions have to take, and to make it possible to express alpha * x without needing users to supply a 1.0. It's a good idea, but it's caused a bit of divergence from the BLAS (P1673R8 has a discussion about this with regards to Triangle). The idea was always that scaled was a way to express setting a scaling factor, and that multiple instances of scaled in an expression are an antipattern. I'm more pessimistic about users' willingness to respect design intent. It would be reasonable to expect scaled(b, scaled(a, m)) to parenthesize like b * (a * m[k]).

Mark Hoemmen 9:25 I'm more pessimistic about users' willingness to respect design intent.

Phil Miller 9:25 AM In any case, I think even with mis-use, actual problems resulting from it will be rare Lots of things have to go wrong together for it to work out badly

Mark Hoemmen 9:26 AM C++ Standard Library design is a bit more like nuclear power plant control system design than library design

Phil Miller 9:30 AM Eh, there are plenty of places where it’s perfectly well accepted that misuse can lead to things going wrong. NPP design review doesn’t really accept “results in undefined behavior”

Mark Hoemmen 9:31 AM true -- on the other hand, i would not like scaled(b, scaled(a, x)) to be UB ; - )

Phil Miller 9:32 AM The examples I’ve come up with have all been pretty extreme corner cases. The common cases of a few static factors or reasonable numbers from input data should be fine And it’s all limited by only mattering when the left-association breaks, but the right-association would be fine Something to think about, but my inclination is that it should probably be noted as a hazard, and left as is, because it is an important optimization

PhilMiller commented 2 years ago

I think the ideal outcome of this would be to fold the scaling factors from right to left, with types as if the fold were initialized by element_type{1.0} -- i.e. its multiplicative identity. For cases where element_type is an arithmetic type, we could just do that.

I don't think the standard specifies a way to ask for the multiplicative identity of an arbitrary user-defined number type. As you've noted elsewhere, you'd like to avoid adding type traits for random numeric bits and bobs to the standard as part of specifying std::linalg.

It would be workable but somewhat unprincipled to convert the right-most scaling factor to decltype(scaling_factor * declval<ElementType>()), and then proceed from there - the conversion could hypothetically entail something weird, rather than preserving the value as best as possible.

PhilMiller commented 2 years ago

Conversely, how cautious are users expected to be in not inadvertently widening their calculations?

If one had mdspan<float, ...> m, and said scaled(2.0, m), rather than scaled(2.0f, m), that's going to come out as double, and all of the subsequent arithmetic will run at half the throughput.

That seems like a really big, likely potential pitfall.

@crtrott

crtrott commented 2 years ago

Note this only became a problem due to the attempt to undo multiple nested scalings, if we wouldn't try to undo it you are back to the () case. Furthermore, we can fix this by just defining the ProductScalingFactor differently: decltype(alpha * (x.scaling_factor() * declval(ElementType))) done. This is not that hard ...

Regarding the widening: the design very explicitly added ScalingFactor as a way to change the type you do calculation in. Say you store your vector in bhalf or something like that, but now you want to scale it out, that requires widening, potentially.

If we really don't want that (and we can ask LEWG on this, I am not really set in stone on this part) we can just get rid of the ScalingFactor template parameter everywhere, and say the ScalingFactor we store and use to compute is ElementType.

mhoemmen commented 2 years ago

If one had mdspan<float, ...> m, and said scaled(2.0, m), ....

  1. Some BLAS libraries support mixed-precision computations, so using double might even improve accuracy without reducing performance too much.
  2. Implementations could notice while unpacking scaled(2.0, A) that 2.0 fits without rounding in float (or that 2.0 is a commonly encountered constant) and dispatch to an all-float BLAS optimized routine.
  3. P1673 might allow (2) for all scalars of type double, not just values that fit in float. See [linalg.reqs.flpt] (though determining this would require a careful reading of "logarithmically stable").

I would prefer not to override existing behavior of C++ arithmetic types. If the user gives us double * float, we think they should get what the language gives them. I don't want to have to look into some custom traits hierarchy to figure out the value type of scaled(2.0, A) for a matrix A of float.

PhilMiller commented 2 years ago

My concern is about users getting upgraded/mixed precision inadvertently, because the parent interface potentially makes it too easy to do so.

For contrast, the classic DGEMM and SGEMM take their scalar coefficients in the same type as the underlying matrices, in both the Fortran and C APIs. Following established practice in that respect would suggest that the element type of a scaled accessor should match the element type of whatever it's scaling, and disregard the type of the scaling factor. In real arithmetic, I think that's likely to be fine, but if one wanted to scale a real matrix by a complex factor (why, I don't know), it wouldn't be.

Inspired by Numpy, would it make sense to offer an as_cast_to_type accessor that could explicitly provide conversion for the intentional promotion case, as from real to complex, or int to float, and otherwise limit the type impact of scaling? EDIT: I think such an accessor should be produced by a function analagous to scaled called converted<T>

PhilMiller commented 2 years ago

The mixed-precision BLAS specifies all scalar arguments in double precision for the mixed-precision cases, and in precision matching the vectors/matrices in the mixed real/complex cases. In no case does the type of the scalar parameters drive the precision of the computation.

PhilMiller commented 2 years ago

@mhoemmen regarding implementation freedom, I think the type impact of applying a scaled accessor is definitely something that should be specified and consistent across platforms, rather than something that implementations will decide arbitrarily. Otherwise, users who want a particular behavior will be unable to spell that intention in a portable fashion.

PhilMiller commented 2 years ago

All of my input here is at least partially colored by recent work I've been doing on implementing mixed-precision configurations of an application, where some parts are single, and other parts are double, and conversions happen at relatively carefully chosen boundaries. One thing that was pointed out to me is that float/double conversion is actually not especially cheap on Nvidia GPUs, contrary to my assumptions.

mhoemmen commented 2 years ago

Consider mdspan<double,…> m; scaled(1<<20, scaled(1 << 20, m)) - the ProductScalingFactor is int, and its value will overflow, yielding UB, rather than scaling by 2^40 as it would appear it should, and readily could if the calculation were in double.

This will be fixed in R9. scaled_scalar will do the product immediately, and the resulting value type will be decltype((1 << 20) * decltype((1 << 20) * declval<double>())), that is, double.