Mixed precision functionality

nbeams commented 3 years ago

This issue is for discussions related to adding mixed precision capabilities to libCEED, including:

Which user-facing functions should have mixed precision capability?
How should the libCEED interface support this (what will the user need to do)?
What combinations should be available (e.g. mix-and-match input and output precisions, ability to decouple input/output precision from precision of computations, ...)
Potential for backend specializations, and how this will fit with the interface changes

An idea from @jedbrown was to handle precision conversions through the CeedVector objects, which could be modified to store multiple pointers for different precisions, along with a tag to indicate which precision that vector's data currently uses.

jedbrown commented 3 years ago

I think qfunction variation is the hardest part. Do we need to support different arguments to the qfunction having different precisions or can they be homogeneous (all arguments in matching precision)? Let's defer this part for later.

As a first step, I would add functions like CeedVectorGetArray_f32(CeedVector, CeedMemType, float **) and have the implementation convert as needed. We can use C11 _Generic so C11 users don't need to specify these types and C++ overloading to do the same for C++ users.

We can have a macro CEED_SCALAR_F32 that users can define before including ceed.h so that C99 callers get suitable CeedScalar and appropriate specialization. Perhaps better is to provide #include <ceed/ceed_f32.h> that sets the macro and includes ceed/ceed.h). This would be for callers who want to choose precision at compile time.

We then add a CeedElemRestrictionSetScalarType(CeedElemRestriction, CeedScalarType evec_type) that converts if necessary (when provided with an L-vector). This will require visiting some backends for the cross-type packing, but I don't think will be a lot of code.

I kinda think CeedBasis public interfaces (setting matrices and doing linear algebra) could just stay double precision all the time, and find it more compelling to possibly add __float128 than to support direct specification in single precision (they can be converted as needed behind the scenes).

We can have CeedVectorGetArrayUntyped(CeedVector, CeedMemType, CeedScalarType, void *) to make life easier for dynamic callers who prefer to cast somewhere else.

nbeams commented 3 years ago

I think qfunction variation is the hardest part. Do we need to support different arguments to the qfunction having different precisions or can they be homogeneous (all arguments in matching precision)? Let's defer this part for later.

Just to make sure I'm clear: are you referring to the user-provided per-point QFunction, the libCEED QFunction interface, or both?

Regarding CeedBasis, when users can tolerate lower precision for their application(s), they could potentially get some nice speedup from single precision CeedBasisApply in the non-tensor case. Specifically, for the MAGMA backend, with the GEMM/batch GEMM computations, the ability to use SGEMM instead of DGEMM could be useful. I don't know if this is considered too narrow of a use case (non-tensor basis and no need for double-precision accuracy in the basis computations)...also, I guess CeedBasisApply is different than the basis creation functions since it doesn't take any CeedScalar arguments, only CeedVectors. If CeedVector knows about its type, a lot of the "overloading" could be handled by the backends inside the backend-specific implementations. I guess this goes back to the question of whether we want to be able to specify computation precision, or have it deduced from the types of the CeedVectors? I'd argue there would likely be interest in being able to do computations in double while input/output are single, but also to be able to do everything in single. Edit: was the intention that _f[something] would be added to routines to specify computation precision, but not necessarily input/output?

jedbrown commented 3 years ago

CeedBasis has several interfaces that take CeedScalar *, and I think there's no harm in making those all double (perhaps with complex variants at some point). It can be converted internally. We have several right now

Outer solver/Krylov vector storage (PETSc, MFEM, etc.) which gets placed into a CeedVector as an L-vector.
E-vector representation (ephemeral)
CeedBasisApply
Q-vectors processed by user-provided QFunction

We could make CeedElemRestriction have no opinion about types, in which case we would convert as needed. Then L-vectors, CeedBasis, and CeedQFunction would have a designated precision. The precision of E-vectors would be inferred to match the compute precision of CeedBasis (converting as needed to/from the L-vector precision).

A simplification would be to make CeedBasis precision always match the QFunction precision.

To make this incremental, I think it will be simpler to start with the L-vectors and get that merged, then work inward.

nbeams commented 3 years ago

I'm not sure I understand what your list is intended to represent. Cases where we use CeedScalar * directly as a type? CeedBasisApply only takes CeedVectors for input/output. E-Vectors are also represented as CeedVectors. So I must be missing something.

I like the idea of the element restriction being able to infer the output (or input, in transpose mode) type, but I'm not sure how it would work in practice. The CeedElemRestriction doesn't know about the CeedBasis or CeedQFunction, so it can't know which precision they use unless we tell it. I think your initial suggestion of a way to set the precision of the element restriction may be necessary to keep the current separation between/independence of CeedOperator "pieces" (element restriction, basis, qfunction).

jedbrown commented 3 years ago

The list included internal representations (E-vectors), and I intended it to identify each place where precision can be varied in G^T B^T D B G. I think it's a significant simplification in the code if we can make the precision of B match the precision of the Q-function D -- it still decouples L-vector precision from compute precision. Note that each (active or passive, input or output) field provided to a CeedOperator could have a different precision. This does restrict algorithmic space somewhat, but I don't know how serious it is. On CPUs, I think it's not a big deal because E-vectors only ever exist in cache. It may be a lot more significant for GPU kernels that need to store complete E-vectors.
Backends can always fuse CeedElemRestriction into whatever else they do. The CPU backends, for example, never have an actual E-vector, just a bit of workspace for one element (/serial) or a batch (/blocked). That said, we have two vectors, so if those vectors have preferred precision set, then all the necessary information is there.
```
int CeedElemRestrictionApply(CeedElemRestriction rstr,
CeedTransposeMode t_mode, CeedVector u, CeedVector ru, CeedRequest *request);
```
The CeedOperator backend gets to choose how to represent the E-vector data so even if it just calls CeedElemRestrictionApply, that function knows what precisions to work with.

nbeams commented 3 years ago

Right, I guess this raises another important question: how much do we want to limit the multiprecision functionality to things that happen inside a CeedOperatorApply? I guess if CeedVector provides a function for converting to other precisions, that would cover any cases where a user (or library) wants to do a stand-alone CeedElemRestrictionApply for something and have the output in a different precision -- should such a situation arise. Or we could specify that you should set the precision of the input/output vectors ahead of time, as suggested.

Though even if we are only considering the case where the Operator is setting the preferred element restriction output precision based on all the information it has, the function that lets you set the element restriction precision might still make more sense than relying on the input/output CeedVectors having their precisions set. I'm thinking this would be more consistent across backends, including those that never set up E-Vectors as CeedVector objects, like cuda-gen -- which does still get information from the element restrictions while building the code string, despite not creating E-Vectors (or using CeedElemRestrictionApply).

jedbrown commented 3 years ago

I figured CeedVectorGetArray() and the like would be able to mirror data to any requested precision, but CeedElemRestrictionApply would inspect the available precisions of the input vector and the preferred precision of the output vector and apply without needing to convert the input.

I'd think cuda-gen could still propagate the desired (ephemeral) E-vector precisions from the Basis. One thing I'd like to facilitate is having a residual in double precision and Jacobian action in single, or the Jacobian in double and a preconditioner in single.

nbeams commented 3 years ago

I'd think cuda-gen could still propagate the desired (ephemeral) E-vector precisions from the Basis.

Yes, it could, I just thought it might be preferable to have more consistency across the CUDA/HIP backends, such that they could all get/set this information in the same way. But cuda(hip)-gen is so different from the other backends already that perhaps that's not really a selling point.

jedbrown commented 3 years ago

Do you think it'll be hard to handle consistently if CeedElemRestriction is precision-agnostic (uses its arguments rather than set as state)?

My concern is that when we wire up several different operators, we'd like for the data (integers) of a CeedElemRestriction to be shared between different precisions. That's easier if it's just the same object, rather than needing to reference count the backing storage.

nbeams commented 3 years ago

Do you think it'll be hard to handle consistently if CeedElemRestriction is precision-agnostic (uses its arguments rather than set as state)?

I've been thinking about this, and I haven't yet come up with any problems (not that that means there couldn't be any, of course). I was initially concerned about CeedElemRestrictionCreateVector, since it calls CeedVectorCreate, so we couldn't set the precision ahead of time. But since CeedVectorCreate doesn't allocate any memory, we could just have a function for setting the precision of a CeedVector which is meant to be called after CeedVectorCreate. In fact, maybe this is what you always had in mind, since it would avoid breaking the current CeedVectorCreate interface by adding a precision argument?

jedbrown commented 3 years ago

Yeah, I don't see harm in a separate CeedVectorSetPrecision(). I also envision being able to convert one way and the other even after memory has been allocated or there are values present (perhaps with a CeedVectorFlush() to free any internal arrays there were mirroring to a different precision).

nbeams commented 3 years ago

Okay, I'm trying to get a clear picture here of what a "CeedVector-centric" approach to multiprecision could look like, based on discussions so far.

Some potential "general principles":

CeedVectors store information about their current type. The default precision set by CeedVectorCreate could remain CeedScalar (probably double), but it can be changed by calling CeedVectorSetPrecision.
The Ceed[Get/Set/Take/Restore]Array functions remain as-is (handling data of type CeedScalar), as well as CeedSetValue.
Ceed[Get/Set/Take/Restore]Array_f32 functions are added to specifically handle raw float * data. (Perhaps also f64 in the case that someone wants to set CeedScalar to single but use double in specific cases?) These would mostly be intended for users wanting to directly interact with CeedVectors in different precisions.
All of the Get/Set/Take functions perform type conversions if the precision of the vector and the data do not match.
Functions like CeedVectorNorm, CeedVectorScale, etc. will use different implementations based on the current precision of the vector.

Internally in the library, in backends, etc. we of course use these Get/Set functions. These are often in places that we would like to be able to handle different types, e.g. within in an Operator, where we have set an E- or Q-Vector to use some precision determined by its Basis or QFunction objects. To keep the code as general as possible, could we use the suggested Ceed[Get/Set/Take]ArrayUntyped in many (it might end up being most?) places internal to the library, rather than any of the specific _f[prec] versions, or the default version (using CeedScalar)? Or would we likely cause some horrible bug-prone mess? Of course, a more general issue is that if we let E-Vectors have different precisions, that could be tricky for many backends, which use an array of type CeedScalar for the E-data...(and the Vector[Get/Set] functions are used to access this data directly). I think this is probably the trickiest issue to handle...

For the "sub"-operators, we could potentially treat them differently:

Element restrictions deduce which implementation of Apply to use based on input/output types. To change their behavior, you must set the precision of the input/output CeedVectors.
Basis objects and QFunctions can have their own precision set, declaring that you want their computations to be performed in a specific precision, no matter the input/output types. Incorrect input/output types could result in an error or an automatic conversion. (Is this too dangerous?)
CeedOperators will use the precision information of their basis and qfunction objects to set the precision for any internal/temporary vectors (E- or Q-Vectors), which will determine the behavior of their element restrictions.

Does this follow more or less what you were thinking?

jedbrown commented 3 years ago

Yeah, that's about right. I think CeedVectorGetArray (and relatives) can default to CeedVectorGetArray_f64 (or whatever matches CeedScalar) and we can use C11 _Generic to make it automatically match types for anyone using a C11 compiler (and similar with templates for C++), so most callers would not need to know about the actual specializations.

I think the implementation becomes much simpler of Basis and QFunction always see homogeneous types. This saves needing to convert within those compute kernels and limits the number of variants that need to be written/maintained/compiled.

nbeams commented 3 years ago

Yeah, that's about right. I think CeedVectorGetArray (and relatives) can default to CeedVectorGetArray_f64 (or whatever matches CeedScalar) and we can use C11 _Generic to make it automatically match types for anyone using a C11 compiler (and similar with templates for C++), so most callers would not need to know about the actual specializations.

I'm not sure I follow. Wouldn't this mean we have to have three interfaces for CeedVector: one for C++, one for C11, and another for C99 (and perhaps that C99 users won't have multiprecision available to them)? How would this work with GetArray usage inside other library routines, like CeedOperatorApply?

I think the implementation becomes much simpler of Basis and QFunction always see homogeneous types. This saves needing to convert within those compute kernels and limits the number of variants that need to be written/maintained/compiled.

By homogeneous types, do you mean just that input/output vectors are all the same type, or that input/output vectors also match the precision of the computation, as well?

jedbrown commented 3 years ago

For vectors, I mean that we can make CeedVectorGetArray automatically use the precision matching the provided pointer when the caller is C11 or C++, but that C99 callers will either need to call the precision-specific functions like CeedVectorGetArray_f32 or declare the default precision for the compilation unit.

By homogeneous types, I mean that the input and output vectors would all have the same precision. A QFunction is free to convert internally (promoting or demoting) as it likes, though the cost of doing so may be nontrivial.

nbeams commented 3 years ago

But we'd still have to have separate interfaces, correct? And a way for the user to specify at compile time whether to compile the C11 or C++ interface? And I'm still not sure how that would work within the library itself (other library functions that call Get/SetArray).

jedbrown commented 3 years ago

We'd need to implement both the _f32 and _f64 variants (also so they can be bound by other languages), but C11 _Generic works something like

#define CeedVectorGetArray(ceed, mtype, x) _Generic((x), \
    double **: CeedVectorGetArray_f64, \
    float **: CeedVectorGetArray_f32 \
  )(ceed, mtype, x)

and we'd provide inline overloads for C++. Whether to provide these within #include <ceed/ceed.h> can depend on __cplusplus and __STDC__.

It's true that we can't call these within the library (so long as we don't want a hard dependency on C11), but I think it's okay because we can be type-agnostic except for the numerical kernels, which need to specialize anyway.

nbeams commented 3 years ago

Oh! I think "inline" was the crucial thing I was missing for C++. Otherwise, I couldn't figure out how we could use __cplusplus if we want to consider the type of compiler being called for the code using the library rather than while compiling the library itself.

nbeams commented 3 years ago

This will make things more complicated, but I've been thinking a lot more about where the precision conversion should take place and whether or not we should limit the input/output types of the basis to match -- strictly from the GPU point of view (this is all different from CPU, as mentioned above). This all mostly applies to the non-fused backends.

First, some ground rules: we'll assume we're memory bound, since we want to optimize performance for the tensor case. Thus, our main goal is reducing the high precision read/write actions to main memory, which will matter more than the lower-precision computation in terms of speedup.

We will also consider two main cases of mixed-precision computation:

The "high-low case", where most computation and data propagation is happening in the higher precision, as well as the ultimate input/output L-vectors, but one or two sub-operators have a lower computation precision.
The "low-high case", where the vectors are propagated between kernels in the lower precision, but computation happens in a higher precision, to gain the additional memory bandwidth but not lose all of the accuracy of a full-low-precision computation.

For clarity, we'll always use double and float to represent the high and low precisions.

Let's consider the two types of input vectors to a CeedOperator: active and passive.

Passive inputs. In the common use case of a passive input which has an identity restriction and no basis action (e.g., qdata produced by the same backend or with the same stride as the current backend), one conversion at the beginning (through a CeedVector function which performs proper copying and/or casting, depending on if we are keeping the data in both precisions or only one), and storage of the data in the new precision to be used with every application of the operator afterward, doesn't seem like a bad choice.

However, this won't necessarily be the best choice in at least two cases: one, when the input data can change between applications of the operator. Then, with every application, we pay the cost to read from double, convert to float, write to float. When it's time to use the data -- likely in the QFunction -- we read again in float. Whereas, if we didn't officially convert the "whole" vector, but had the QFunction kernel itself convert local data to float, then we only pay one double read cost at the time of QFunction application. (Again, this only applies if we can't just use the vector we already converted to float, because the data changed.)

The second case is the "low-high" case, where the input to the QFunction will be float, but we want to convert it locally to double for the QFunction computation. If the CeedVector is already in single precision, then we don't want to convert to double "up front" when applying the operator, but rather read from float and convert inside the QFunction kernel.

This is assuming that we are still allowed to have different computation and input/output precisions for the QFunction, even if we specify that all inputs and outputs should be the same precision.

Active inputs. Here, the best guideline for memory movement optimization seems to be to have the kernels handle conversion "locally," especially since in general, the vectors being propagated between kernels are internal only (E-Vectors and Q-Vectors). Furthermore, the conversion should be handled at the output of the previous kernel.

Let's consider the case where everything will be in double precision, except the QFunction. The output of the basis kernel for the active input will be an input for the QFunction, so it needs to be in float. If we apply the basis kernel as usual, call a conversion kernel/function (through the libCEED API), then call the QFunction kernel, we have a write in double + read in double/write in float + read in float. If the QFunction handles the conversion locally into workspace data, we still have write in double + read in float. If, however, the internal Q-vector created for the output of this input's E-vector already has its data allocated for float, and the basis kernel performs the write to float at its last step (converting from local double workspace that was used in the computation), then we only have write in float + read in float. Of course, if we want the basis and QFunction to be done in single precision, then we'd want the write to float happening on the output of the element restriction.

Also consider the "low-high" case: here, we never want double precision CeedVectors to exist internally, to avoid all read/writes to main memory in double precision. Instead, each kernel can read in float, use a local double workspace, then convert back and write in float.

This setup comes with several drawbacks, like the proliferation of kernels (compute precision + input precision + output precision) and needing extra local workspace memory. It also means a lot of code changes in every backend. In the MAGMA backend, we have tensor basis kernels with a lot of template parameters, so we're building a lot of kernels already... However, I'm not sure how we can see much gain on GPUs without letting the kernels handle the conversion locally, which requires doing it in the kernels themselves, unless I am missing something.

All of this raises the question, I suppose, on how much this functionality matters for the non-fused GPU backends, since it's all considered in terms of the tensor basis case, where one would generally want to use cuda-gen, if possible. This could also be an argument for limiting the scope of what's allowed in terms of mix and match, as @jedbrown mentioned previously (e.g., basis and QFunction computation precision must match), which gets us back to the case of the element restriction handling the conversion (though it would still need to be inside the kernel). I do think that the "low-high" case might be especially worth pursuing on GPUs, however. For example, in cuda-gen, due to being memory bound, if we could read/write L-vectors in single precision (requiring the user to provide single precision vectors) but do everything else in double, can we achieve similar speedup as single but with less error?

jedbrown commented 3 years ago

Thanks for these detailed thoughts. My opinion is that conversion should happen in restriction and that when a vector (active or passive) does not match the compute precision, the restriction (or transpose) should be explicit even if it's otherwise the identity. We could have an interface to make a CeedOperator coerce its passive inputs so that you can fill it up using whatever precision is convenient and convert so that two copies are not persisted. I consider this a minor optimization.

I think low-high is most important for speed with simple qfunctions. For very expensive qfunctions, there may be benefit of lower precision (but numerical stability is usually not well thought out for very expensive qfunctions). Also, global vectors of double may be necessary in some cases to maintain orthogonality in Krylov methods, though there is some interesting work (including Kasia and Steve's) to improve accuracy with reduced precision Krylov basis.

If we were to decouple Basis and QFunction precision as a libCEED feature, we'd want to be able to do it automatically, not by forcing the conversion into the user's QFunction. But for the specific experiments I'd like to run to test/demonstrate benefits of stable numerics in QFunctions, I'm happy to manage the code that converts inside the QFunction. (That is, QFunction author is entirely responsible for conversions, and libCEED doesn't need to do anything.)

I think ability to convert in Restriction is tractable without code/template explosions and would open up significant research and production options. So I would accept that scope and only consider further extensions if/when a compelling case is mode for them.

nbeams commented 3 years ago

One potential issue with always doing the conversion in the restriction is that minimized memory movement will depend on the mode: for high-low, it's better to convert in the element restriction (write to float, then read from float for basis), but for low-high, it'd be better to wait until the basis application and do a local conversion (write to float, read from float + convert to double in registers vs. write to double, read from double).

That said, I have gone ahead and created two branches where I am playing around with conversion in the restriction for cuda-gen -- not in a way we'd want to merge, but to compare performance. The two versions are low-high and high-low, with everything after the conversion in the restriction happening in the same precision (and with basis data stored in the computation precision, i.e., the "second" precision). I think the exact location of the conversion to higher precision matters less for cuda-gen than the non-fused CUDA backends due to how the element restrictions work (reading into registers, not doing entire elem restriction as a separate kernel). The one snag here is that technically, not quite all computation happens in the "second" precision, due to the atomicAdd calls in the transpose element restriction needing to match the output (L-vector) precision. The addition/reduction of E-vector into L-vector isn't much computation compared to the basis or QFunction, but it is still a different situation than the input element restrictions, which just convert the data.

jedbrown commented 3 years ago

Good points. I'd say the conversion is logically part of the restriction, though an implementation could choose to associate/fuse it differently. I think before we put too much thought into optimizing this pattern, we'll want some real-world tests. On Gauss-Lobatto nodes to Gauss quadrature points, the condition number of the basis interpolation matrix is less than 3 and the basis derivative matrix is around 10. So this part gives up about one digit of accuracy to perform in single precision versus double. I ultimately think we may consider methods that use fixed-point 32-bit integers for some of the intermediate representations, but I think it's premature until we have use cases showing where the extra bits are useful. And I think associating with restriction is entirely reasonable for these tests and performance with cuda-gen won't care.

nbeams commented 3 years ago

After some informal discussions on Slack, I have the following proposal for mixed-precision "rules" for CeedVectors:

For backends with mixed-precision usage, CeedVectors backend data structs will no longer contain CeedScalar * arrays, but rather structs which can hold pointers to arrays in different precisions.

A suggestion for what this struct might be from @jedbrown was:

struct CeedScalarArray { 
size_t length;
double *double_data;
float *float_data;
< maybe other precisions if we add them in the future >
}

The CeedVector struct itself will have a new member for its preferred precision. This member will be a CeedScalarType.
When a vector is accessed for read-only and a precision other than the preferred precision is indicated, a new data array will be allocated in the CeedScalarArray if necessary, and the data from the array corresponding to the preferred precision will be used to copy/convert to the requested precision. (Theoretically, then if we allowed more precisions in the future, a second read-only access in a third precision would still copy from whatever the preferred precision array is.) We may or may not need a way to keep track of which data arrays are currently allocated inside the struct or the vector.
When a vector is accessed for read-write access, the CeedScalarArray will "collapse" down to only the requested access precision. Any other arrays that may be allocated will be deallocated. The vector's preferred precision will be updated to match the requested precision. (Would we want two versions of this function, such that the default version will copy/convert from the preferred precision into the new array, but another version of the function will just allocate new memory to save time when we are just going to overwrite the data anyway?)

However, I realize this will cause problems, perhaps, with memory management. Right now we have a clear distinction in the backend data structs between arrays that have been allocated by libCEED vs given as a pointer, to make sure and deallocate as necessary when the vector is destroyed. A memory management system like the one above could lead to a CeedScalarArray containing a pointer to double data owned elsewhere and a pointer to float data that was allocated by libCEED, for example. So I guess we could keep the array and array_allocated setup in the backend data, but where data in array could be used to create data in array_allocated upon read access request in a different precision? Could there be any problems there that I'm missing?

jedbrown commented 3 years ago

I think this is a good model. Yes, the array_allocated would be per-precision. I don't think eager deallocation is necessary since I think it usually won't change the peak memory use.

jeremylt commented 2 years ago

Rough code for what I was thinking of.

typedef enum {
  CEED_PRECISION_HALF = 0,
  CEED_PRECISION_SINGLE = 1,
  CEED_PRECISION_DOUBLE = 2,
  CEED_PRECISION_DOUBLE_DOUBLE = 3,
} CeedPrecisionType;

...

char *data_ptrs[4];

...

double *double_ptr = data_ptrs[CEED_PRECISION_DOUBLE];

jeremylt commented 2 years ago

// Set two pointers of  user arrays to use
CeedVectorSetArray(v, CEED_MEM_HOST, CEED_PRECISION_HALF, CEED_USE_POINTER, &h_ptr);
CeedVectorSetArray(v, CEED_MEM_HOST, CEED_PRECISION_DOUBLE, CEED_USE_POINTER, &d_ptr);

// Copy to half precision
// Would contain data from d_ptr converted to half
CeedVectorTakeArray(v, CEED_MEM_HOST, CEED_PRECISION_HALF, &h_ptr);

nbeams commented 2 years ago

Did you want to create a separate CeedPrecisionType when we already have CeedScalarType? Do you think they should be separate?

Would a data_ptrs array take the place of each of array/array_allocated (or array/array_borrowed/array_owned as proposed in #853) in the backend's private data for the Vector?

I was thinking something like

(inside backend data)
CeedScalarArray array;
CeedScalarArray array_borrowed;
CeedScalarArray array_owned;

so the double_ptr would be array.double_data.

One reason I thought it would be better to stick with something like the CeedScalarArray object is that we could use it cleanly in, e.g., edata arrays inside Operators (in case not all E-Vectors have to have the same precision, or just so we don't have to have separate pointers for every precision the E-Vectors may use, even if they're all the same). However, we could probably work around this if something like the data_ptrs setup is preferred.

(And maybe we should pick a different name for CeedScalarArray -- like CeedData or something more generic?)

jeremylt commented 2 years ago

Oh, I was just scratching down some ideas. I wouldn't replicate two enums for basically the same thing, I just didn't remember the enum names offhand.

I don't think we would want to pass around an object of different precision pointers divorced from their validity/staleness information, right? There is a lot of state management logic, and I'd think we would want to hide that inside of the CeedVector so that logic doesn't have to leak into other parts of the interface.

Do we want to adopt the convention that all QFunction data will be in the same precision? That would simplify our internal needs.

jedbrown commented 2 years ago

I think it's reasonable for QFunctions to use homogeneous precision because

conversion isn't free, so not great to have occur in an inner loop that needs to vectorize
conversion is clutter in user code and implicit promotion rules may be surprising and not match between languages
it'll be error-prone to ensure safety between the source of a QFunction and creating the object

jeremylt commented 2 years ago

We should probably expect the user to explicitly set the precision of the context data?

jedbrown commented 2 years ago

The context is just bytes. It can contain integers, strings, whatever. I was thinking of the input and output args.

jeremylt commented 2 years ago

Right. It just occurred to me though that we have examples of using CeedScalar in context struct definitions all over the place, but that won't work if the user has context doubles but wants to run experiments with using Operator and QFunction evaluation in single precision. The codegen backends would reasonably expect to recycle the same source for any precision operators, right?

nbeams commented 2 years ago

I don't think we would want to pass around an object of different precision pointers divorced from their validity/staleness information, right? There is a lot of state management logic, and I'd think we would want to hide that inside of the CeedVector so that logic doesn't have to leak into other parts of the interface.

Do you mean re: my comment about edata? I just meant that would be a way to allow the operator backend data to have arrays that could work for any E-Vector precision. We'd probably want to manage it such that each edata entry only ever has one precision that it actually uses, though. Unless that (edata inside an Operator) is the place to use something low-level like data_ptr (an array of them).

jeremylt commented 2 years ago

If I request a single pointer, then the CeedVector's state management logic knows exactly what I what I have access to and if I am reading or writing. But if we hand back multiple pointers, then how does the CeedVector's state management logic know which one I ultimately changed and which one is valid?

I suppose we would be NULLing out the other pointers if we request write access, so it would be obvious which pointer is valid in that case. But if we request read access we would have to explicitly request each precision that needs to be synced to make sure that we have valid data, so I'm not sure how the struct helps?

I guess I'm not seeing what passing this struct around gives us that we wouldn't get from just using an interface with a void * and requested the target precision? It feels like I could write some bugs that would take me a while to find if I had more pointers that I could mix up.

nbeams commented 2 years ago

But if we hand back multiple pointers, then how does the CeedVector's state management logic know which one I ultimately changed and which one is valid?

Hand back to what, where?

Sorry, I was previously experimenting with an interface passing CeedScalarArray, but now I am talking about an interface which uses void * in the Get/Set, but with the CeedVector (and Operator, in edata) is using the struct to store its data internally.

Though, I do like the idea of being able to access the proper precision from a variable rather than a name...(meaning the value given in [ ] could be a variable of the precision type)

nbeams commented 2 years ago

I made some updates to the document based on our discussions today: https://github.com/CEED/libCEED/blob/mixed-precision-design/libceed-mixed-design-decisions.md

I plan to use this branch for further tests or experiments in implementation.

Also, anyone else should feel free to add/edit thoughts in the document on that branch. It's still a very rough draft of ideas and things we've talked about thus far.

jeremylt commented 2 years ago

Are you thinking of this edata

typedef struct {
  bool is_identity_qf, is_identity_restr_op;
  CeedVector
  *e_vecs;   /* E-vectors needed to apply operator (input followed by outputs) */
  CeedScalar **e_data;
  uint64_t *input_state;   /* State counter of inputs */
  CeedVector *e_vecs_in;   /* Input E-vectors needed to apply operator */
  CeedVector *e_vecs_out;  /* Output E-vectors needed to apply operator */
  CeedVector *q_vecs_in;   /* Input Q-vectors needed to apply operator */
  CeedVector *q_vecs_out;  /* Output Q-vectors needed to apply operator */
  CeedInt    num_e_vecs_in;
  CeedInt    num_e_vecs_out;
  CeedInt    qf_num_active_in, qf_num_active_out;
  CeedVector *qf_active_in;
} CeedOperator_Ref;

That probably shouldn't have been put into this struct in the first place. It is a legacy from a hacky approach I had in the Active/Passive refactor years ago before we added reference counting on CeedVector access. Really it should be eleminated.

nbeams commented 2 years ago

To be clear, @jeremylt, was your suggestion of data_ptrs for use of things like E-vector storage, or also for what we should do inside CeedVector itself, rather than having the CeedScalarArray structs?

jeremylt commented 2 years ago

At least in my mind it makes sense to leave the data wrapped in CeedVector objects as much as possible and only handle raw pointers when we must. It isn't clear to me where this CeedScalarArray would help us?

nbeams commented 2 years ago

Well, two reasons, I guess. Originally, I thought we may have a case where the user or backend functions would want to interact directly with the multiprecision data "thing," whatever that was, in a way that would work for the other languages (not just relying on void *). Stylistically, I thought it was nice to have something that seemed less--obscure, maybe? esoteric?--than going with an array of char * to store pointers to floating point data. I know it's a perfectly valid C strategy, but it seems kind of... old-fashioned? Especially for developers who are not coming from a C background.

But being able to access the data through a variable (a value from the type enum) as you showed above would be very helpful for the purposes of type-agnostic backend code. So, I do see that as an advantage to char * as you have proposed.

jeremylt commented 2 years ago

I don't think we should ever allow a user to have more than one pointer at a time for write access. That breaks our knowledge of what data is valid.

Internally, I could see it being useful to arrange how we store the 12 pointers we would need to track inside of a vector in some consistent way or if we wanted to repeat this idea inside of basis data for interp and grad matrices, but I really think that data should only pass outside of the CeedVector in a single pointer to a single array at a time.

For our other languages, I would expect us to wrap the C API in whatever is ideomatic for that language so the user doesn't have to touch the C at all. For Rust we have to be explicit about the types so I would expect separate interfaces for getting a float or double array. I don't know about Python, but I'd expect using overloading on array types in Julia.

nbeams commented 2 years ago

Yeah, I was thinking it might be more useful in the SetArray context (can set multiple data pointers at once if multiple precisions are valid or will be used) but since we're going with "one-per call" and void * that doesn't really matter anymore.

Internally, I could see it being useful to arrange how we store the 12 pointers we would need to track inside of a vector in some consistent way or if we wanted to repeat this idea inside of basis data for interp and grad matrices

I was indeed thinking that we could use whatever storage is in CeedVector inside other components, like basis objects or operators. But we could use the same type of storage across backend objects no matter what type of storage it is (struct or pointer array). And of course, if it's only inside the backend data and implementations, I guess different backends could do different things, though consistency across backends would probably be better.

jedbrown commented 2 years ago

Just a small note that it can be a struct containing an array, but shouldn't be a raw array (less type checking/harder to extend).

nbeams commented 2 years ago

@jedbrown do you mean you prefer sticking with something like CeedScalarArray with members for each currently-possible precision type? Or using a char * array as @jeremylt suggested, but wrapped in a struct? Like

struct CeedScalarArray { 
  size_t length;  //this may or may not stay? maybe just the ptr array?
  char *values[CEED_NUM_PRECISIONS];
}

then if we have a member array of type CeedScalarArray inside the backend data of our CeedVector, we access the precision we want through (float *) array.values[CEED_SCALAR_FP32] (or whatever)

Or some other option?

jedbrown commented 2 years ago

Like this is good in my opinion, I just want it inside a struct (like you suggest) rather than a raw array of pointers. Open question of whether these should be char* or void*. The latter can't participate in pointer arithmetic, which might be good.

char *p = array.values[scalar_type];
p + 100 * bytes_per_scalar; // valid

void *q = array.values[scalar_type];
q + 100 * bytes_per_scalar; // invalid!

void* is appropriate for functions like memcpy. I think I lean toward void* so that doing pointer arithmetic becomes a more deliberate act.

nbeams commented 2 years ago

That's a good point about pointer arithmetic. Maybe an array of void * inside a struct would be better.

nbeams commented 2 years ago

Based on our discussion yesterday (and previous discussion), I decided to make up some schematics to hopefully clearly show my interpretation of some options that have been proposed, and the pros/cons.

In the following schematics:

Solid line arrows indicate that the operator input/output (L-Vector) is in double precision; dotted line arrows = float
The color of the arrow indicates the current precision at that point in the computation (dark blue = double, light blue = float)
There are 4 restrictions based on precision combinations (really, there are 8, since each could also be strided/offset)
The QFunctions are attempting to represent that we have "two" QFunctions, really: the libCEED QFunction kernel/function, and the user QFunction which is the function describing operator physics and which matches the signature of the function pointer type defined in libCEED.

Let's assume we can set the following precision options at an Operator level in libCEED:
- precision of each input/output. This will not be a property of the Vector itself, but rather an indication to the Operator of which precision it should use to interact with the L-Vector. The multiprecision Vector framework we are working out in #948 will ensure that any necessary conversion is done by the Vector, and the Operator can assume that the Vector will have valid data in the precision that has been set.
- Computation precision, either aimed at QFunction with Basis precisions determined by a combination of input/output precisions and QFunction precision (discussed more later), or uniform for all Basis and QFunction computations.

If we start with the simpler option for computation precision, that all Basis operations match the QFunction precision, this is what would happen to float and double inputs/outputs with float computation precision:

libceed-mixed-precision-drawing-flt

And this would be the double precision computation case:

libceed-mixed-precision-drawing-dbl-one

However, as mentioned in the discussion, the basis computations are generally very well-conditioned, and don't need higher precision for the sake of accuracy if the operator input is already in lower precision anyway. My experience is mostly in the GPU backends, where something like this could potentially have a benefit (especially if all inputs/outputs are float, but we want to do QFunction in double due to conditioning concerns):

libceed-mixed-precision-drawing-dbl-two

In this, case the little cvt squares indicate that the libCEED QFunction would handle local conversion to/from the higher precision, as part of what libCEED does "around" the user-defined QFunction. In the GPU backends, this would look like adding a conversion to the device functions which read the Q-Vectors into registers in the correct ordering, and shouldn't really cost much of anything. I'm not sure how easy this would be to implement in the CPU backends (especially without the benefit of JIT).

However, is this too confusing/surprising to have these different options based on configuration combinations of input/output precisions and computation precision? It would also be "asymmetric," as the float computation precision case would never worry about this (we'd always want to convert to float as soon as possible anyway).

CEED / libCEED

Mixed precision functionality #778