Open nbeams opened 3 years ago
I think qfunction variation is the hardest part. Do we need to support different arguments to the qfunction having different precisions or can they be homogeneous (all arguments in matching precision)? Let's defer this part for later.
As a first step, I would add functions like CeedVectorGetArray_f32(CeedVector, CeedMemType, float **)
and have the implementation convert as needed. We can use C11 _Generic
so C11 users don't need to specify these types and C++ overloading to do the same for C++ users.
We can have a macro CEED_SCALAR_F32
that users can define before including ceed.h
so that C99 callers get suitable CeedScalar
and appropriate specialization. Perhaps better is to provide #include <ceed/ceed_f32.h>
that sets the macro and includes ceed/ceed.h
). This would be for callers who want to choose precision at compile time.
We then add a CeedElemRestrictionSetScalarType(CeedElemRestriction, CeedScalarType evec_type)
that converts if necessary (when provided with an L-vector). This will require visiting some backends for the cross-type packing, but I don't think will be a lot of code.
I kinda think CeedBasis
public interfaces (setting matrices and doing linear algebra) could just stay double precision all the time, and find it more compelling to possibly add __float128
than to support direct specification in single precision (they can be converted as needed behind the scenes).
We can have CeedVectorGetArrayUntyped(CeedVector, CeedMemType, CeedScalarType, void *)
to make life easier for dynamic callers who prefer to cast somewhere else.
I think qfunction variation is the hardest part. Do we need to support different arguments to the qfunction having different precisions or can they be homogeneous (all arguments in matching precision)? Let's defer this part for later.
Just to make sure I'm clear: are you referring to the user-provided per-point QFunction, the libCEED QFunction interface, or both?
Regarding CeedBasis
, when users can tolerate lower precision for their application(s), they could potentially get some nice speedup from single precision CeedBasisApply
in the non-tensor case. Specifically, for the MAGMA backend, with the GEMM/batch GEMM computations, the ability to use SGEMM instead of DGEMM could be useful. I don't know if this is considered too narrow of a use case (non-tensor basis and no need for double-precision accuracy in the basis computations)...also, I guess CeedBasisApply
is different than the basis creation functions since it doesn't take any CeedScalar
arguments, only CeedVector
s. If CeedVector
knows about its type, a lot of the "overloading" could be handled by the backends inside the backend-specific implementations. I guess this goes back to the question of whether we want to be able to specify computation precision, or have it deduced from the types of the CeedVector
s? I'd argue there would likely be interest in being able to do computations in double while input/output are single, but also to be able to do everything in single. Edit: was the intention that _f[something]
would be added to routines to specify computation precision, but not necessarily input/output?
CeedBasis
has several interfaces that take CeedScalar *
, and I think there's no harm in making those all double
(perhaps with complex variants at some point). It can be converted internally. We have several right now
CeedVector
as an L-vector.We could make CeedElemRestriction have no opinion about types, in which case we would convert as needed. Then L-vectors, CeedBasis, and CeedQFunction would have a designated precision. The precision of E-vectors would be inferred to match the compute precision of CeedBasis (converting as needed to/from the L-vector precision).
A simplification would be to make CeedBasis
precision always match the QFunction precision.
To make this incremental, I think it will be simpler to start with the L-vectors and get that merged, then work inward.
I'm not sure I understand what your list is intended to represent. Cases where we use CeedScalar *
directly as a type? CeedBasisApply
only takes CeedVector
s for input/output. E-Vectors are also represented as CeedVector
s. So I must be missing something.
I like the idea of the element restriction being able to infer the output (or input, in transpose mode) type, but I'm not sure how it would work in practice. The CeedElemRestriction
doesn't know about the CeedBasis
or CeedQFunction
, so it can't know which precision they use unless we tell it. I think your initial suggestion of a way to set the precision of the element restriction may be necessary to keep the current separation between/independence of CeedOperator "pieces" (element restriction, basis, qfunction).
The list included internal representations (E-vectors), and I intended it to identify each place where precision can be varied in G^T B^T D B G
. I think it's a significant simplification in the code if we can make the precision of B
match the precision of the Q-function D
-- it still decouples L-vector precision from compute precision. Note that each (active or passive, input or output) field provided to a CeedOperator
could have a different precision. This does restrict algorithmic space somewhat, but I don't know how serious it is. On CPUs, I think it's not a big deal because E-vectors only ever exist in cache. It may be a lot more significant for GPU kernels that need to store complete E-vectors.
Backends can always fuse CeedElemRestriction
into whatever else they do. The CPU backends, for example, never have an actual E-vector, just a bit of workspace for one element (/serial
) or a batch (/blocked
). That said, we have two vectors, so if those vectors have preferred precision set, then all the necessary information is there.
int CeedElemRestrictionApply(CeedElemRestriction rstr,
CeedTransposeMode t_mode, CeedVector u, CeedVector ru, CeedRequest *request);
The CeedOperator
backend gets to choose how to represent the E-vector data so even if it just calls CeedElemRestrictionApply
, that function knows what precisions to work with.
Right, I guess this raises another important question: how much do we want to limit the multiprecision functionality to things that happen inside a CeedOperatorApply
? I guess if CeedVector
provides a function for converting to other precisions, that would cover any cases where a user (or library) wants to do a stand-alone CeedElemRestrictionApply
for something and have the output in a different precision -- should such a situation arise. Or we could specify that you should set the precision of the input/output vectors ahead of time, as suggested.
Though even if we are only considering the case where the Operator is setting the preferred element restriction output precision based on all the information it has, the function that lets you set the element restriction precision might still make more sense than relying on the input/output CeedVector
s having their precisions set. I'm thinking this would be more consistent across backends, including those that never set up E-Vectors as CeedVector
objects, like cuda-gen -- which does still get information from the element restrictions while building the code string, despite not creating E-Vectors (or using CeedElemRestrictionApply
).
I figured CeedVectorGetArray()
and the like would be able to mirror data to any requested precision, but CeedElemRestrictionApply
would inspect the available precisions of the input vector and the preferred precision of the output vector and apply without needing to convert the input.
I'd think cuda-gen could still propagate the desired (ephemeral) E-vector precisions from the Basis. One thing I'd like to facilitate is having a residual in double precision and Jacobian action in single, or the Jacobian in double and a preconditioner in single.
I'd think cuda-gen could still propagate the desired (ephemeral) E-vector precisions from the Basis.
Yes, it could, I just thought it might be preferable to have more consistency across the CUDA/HIP backends, such that they could all get/set this information in the same way. But cuda(hip)-gen is so different from the other backends already that perhaps that's not really a selling point.
Do you think it'll be hard to handle consistently if CeedElemRestriction is precision-agnostic (uses its arguments rather than set as state)?
My concern is that when we wire up several different operators, we'd like for the data (integers) of a CeedElemRestriction
to be shared between different precisions. That's easier if it's just the same object, rather than needing to reference count the backing storage.
Do you think it'll be hard to handle consistently if CeedElemRestriction is precision-agnostic (uses its arguments rather than set as state)?
I've been thinking about this, and I haven't yet come up with any problems (not that that means there couldn't be any, of course). I was initially concerned about CeedElemRestrictionCreateVector
, since it calls CeedVectorCreate
, so we couldn't set the precision ahead of time. But since CeedVectorCreate
doesn't allocate any memory, we could just have a function for setting the precision of a CeedVector
which is meant to be called after CeedVectorCreate
. In fact, maybe this is what you always had in mind, since it would avoid breaking the current CeedVectorCreate
interface by adding a precision argument?
Yeah, I don't see harm in a separate CeedVectorSetPrecision()
. I also envision being able to convert one way and the other even after memory has been allocated or there are values present (perhaps with a CeedVectorFlush()
to free any internal arrays there were mirroring to a different precision).
Okay, I'm trying to get a clear picture here of what a "CeedVector
-centric" approach to multiprecision could look like, based on discussions so far.
Some potential "general principles":
CeedVector
s store information about their current type. The default precision set by CeedVectorCreate
could remain CeedScalar
(probably double), but it can be changed by calling CeedVectorSetPrecision
. Ceed[Get/Set/Take/Restore]Array
functions remain as-is (handling data of type CeedScalar
), as well as CeedSetValue
. Ceed[Get/Set/Take/Restore]Array_f32
functions are added to specifically handle raw float *
data. (Perhaps also f64
in the case that someone wants to set CeedScalar
to single but use double in specific cases?) These would mostly be intended for users wanting to directly interact with CeedVector
s in different precisions. Get/Set/Take
functions perform type conversions if the precision of the vector and the data do not match.CeedVectorNorm
, CeedVectorScale
, etc. will use different implementations based on the current precision of the vector.Internally in the library, in backends, etc. we of course use these Get/Set
functions. These are often in places that we would like to be able to handle different types, e.g. within in an Operator
, where we have set an E-
or Q-Vector
to use some precision determined by its Basis
or QFunction
objects. To keep the code as general as possible, could we use the suggested Ceed[Get/Set/Take]ArrayUntyped
in many (it might end up being most?) places internal to the library, rather than any of the specific _f[prec]
versions, or the default version (using CeedScalar
)? Or would we likely cause some horrible bug-prone mess? Of course, a more general issue is that if we let E-Vectors
have different precisions, that could be tricky for many backends, which use an array of type CeedScalar
for the E-data...(and the Vector[Get/Set]
functions are used to access this data directly). I think this is probably the trickiest issue to handle...
For the "sub"-operators, we could potentially treat them differently:
Apply
to use based on input/output types. To change their behavior, you must set the precision of the input/output CeedVector
s.CeedOperator
s will use the precision information of their basis and qfunction objects to set the precision for any internal/temporary vectors (E-
or Q-Vector
s), which will determine the behavior of their element restrictions. Does this follow more or less what you were thinking?
Yeah, that's about right. I think CeedVectorGetArray
(and relatives) can default to CeedVectorGetArray_f64
(or whatever matches CeedScalar
) and we can use C11 _Generic
to make it automatically match types for anyone using a C11 compiler (and similar with templates for C++), so most callers would not need to know about the actual specializations.
I think the implementation becomes much simpler of Basis and QFunction always see homogeneous types. This saves needing to convert within those compute kernels and limits the number of variants that need to be written/maintained/compiled.
Yeah, that's about right. I think CeedVectorGetArray (and relatives) can default to CeedVectorGetArray_f64 (or whatever matches CeedScalar) and we can use C11 _Generic to make it automatically match types for anyone using a C11 compiler (and similar with templates for C++), so most callers would not need to know about the actual specializations.
I'm not sure I follow. Wouldn't this mean we have to have three interfaces for CeedVector
: one for C++, one for C11, and another for C99 (and perhaps that C99 users won't have multiprecision available to them)? How would this work with GetArray
usage inside other library routines, like CeedOperatorApply
?
I think the implementation becomes much simpler of Basis and QFunction always see homogeneous types. This saves needing to convert within those compute kernels and limits the number of variants that need to be written/maintained/compiled.
By homogeneous types, do you mean just that input/output vectors are all the same type, or that input/output vectors also match the precision of the computation, as well?
For vectors, I mean that we can make CeedVectorGetArray
automatically use the precision matching the provided pointer when the caller is C11 or C++, but that C99 callers will either need to call the precision-specific functions like CeedVectorGetArray_f32
or declare the default precision for the compilation unit.
By homogeneous types, I mean that the input and output vectors would all have the same precision. A QFunction is free to convert internally (promoting or demoting) as it likes, though the cost of doing so may be nontrivial.
But we'd still have to have separate interfaces, correct? And a way for the user to specify at compile time whether to compile the C11 or C++ interface? And I'm still not sure how that would work within the library itself (other library functions that call Get/SetArray
).
We'd need to implement both the _f32
and _f64
variants (also so they can be bound by other languages), but C11 _Generic
works something like
#define CeedVectorGetArray(ceed, mtype, x) _Generic((x), \
double **: CeedVectorGetArray_f64, \
float **: CeedVectorGetArray_f32 \
)(ceed, mtype, x)
and we'd provide inline overloads for C++. Whether to provide these within #include <ceed/ceed.h>
can depend on __cplusplus
and __STDC__
.
It's true that we can't call these within the library (so long as we don't want a hard dependency on C11), but I think it's okay because we can be type-agnostic except for the numerical kernels, which need to specialize anyway.
Oh! I think "inline" was the crucial thing I was missing for C++. Otherwise, I couldn't figure out how we could use __cplusplus
if we want to consider the type of compiler being called for the code using the library rather than while compiling the library itself.
This will make things more complicated, but I've been thinking a lot more about where the precision conversion should take place and whether or not we should limit the input/output types of the basis to match -- strictly from the GPU point of view (this is all different from CPU, as mentioned above). This all mostly applies to the non-fused backends.
First, some ground rules: we'll assume we're memory bound, since we want to optimize performance for the tensor case. Thus, our main goal is reducing the high precision read/write actions to main memory, which will matter more than the lower-precision computation in terms of speedup.
We will also consider two main cases of mixed-precision computation:
For clarity, we'll always use double and float to represent the high and low precisions.
Let's consider the two types of input vectors to a CeedOperator
: active and passive.
CeedVector
function which performs proper copying and/or casting, depending on if we are keeping the data in both precisions or only one), and storage of the data in the new precision to be used with every application of the operator afterward, doesn't seem like a bad choice.However, this won't necessarily be the best choice in at least two cases: one, when the input data can change between applications of the operator. Then, with every application, we pay the cost to read from double, convert to float, write to float. When it's time to use the data -- likely in the QFunction -- we read again in float. Whereas, if we didn't officially convert the "whole" vector, but had the QFunction kernel itself convert local data to float, then we only pay one double read cost at the time of QFunction application. (Again, this only applies if we can't just use the vector we already converted to float, because the data changed.)
The second case is the "low-high" case, where the input to the QFunction will be float, but we want to convert it locally to double for the QFunction computation. If the CeedVector
is already in single precision, then we don't want to convert to double "up front" when applying the operator, but rather read from float and convert inside the QFunction kernel.
This is assuming that we are still allowed to have different computation and input/output precisions for the QFunction, even if we specify that all inputs and outputs should be the same precision.
Let's consider the case where everything will be in double precision, except the QFunction. The output of the basis kernel for the active input will be an input for the QFunction, so it needs to be in float. If we apply the basis kernel as usual, call a conversion kernel/function (through the libCEED API), then call the QFunction kernel, we have a write in double + read in double/write in float + read in float. If the QFunction handles the conversion locally into workspace data, we still have write in double + read in float. If, however, the internal Q-vector created for the output of this input's E-vector already has its data allocated for float, and the basis kernel performs the write to float at its last step (converting from local double workspace that was used in the computation), then we only have write in float + read in float. Of course, if we want the basis and QFunction to be done in single precision, then we'd want the write to float happening on the output of the element restriction.
Also consider the "low-high" case: here, we never want double precision CeedVectors
to exist internally, to avoid all read/writes to main memory in double precision. Instead, each kernel can read in float, use a local double workspace, then convert back and write in float.
This setup comes with several drawbacks, like the proliferation of kernels (compute precision + input precision + output precision) and needing extra local workspace memory. It also means a lot of code changes in every backend. In the MAGMA backend, we have tensor basis kernels with a lot of template parameters, so we're building a lot of kernels already... However, I'm not sure how we can see much gain on GPUs without letting the kernels handle the conversion locally, which requires doing it in the kernels themselves, unless I am missing something.
All of this raises the question, I suppose, on how much this functionality matters for the non-fused GPU backends, since it's all considered in terms of the tensor basis case, where one would generally want to use cuda-gen, if possible. This could also be an argument for limiting the scope of what's allowed in terms of mix and match, as @jedbrown mentioned previously (e.g., basis and QFunction computation precision must match), which gets us back to the case of the element restriction handling the conversion (though it would still need to be inside the kernel). I do think that the "low-high" case might be especially worth pursuing on GPUs, however. For example, in cuda-gen, due to being memory bound, if we could read/write L-vectors in single precision (requiring the user to provide single precision vectors) but do everything else in double, can we achieve similar speedup as single but with less error?
Thanks for these detailed thoughts. My opinion is that conversion should happen in restriction and that when a vector (active or passive) does not match the compute precision, the restriction (or transpose) should be explicit even if it's otherwise the identity. We could have an interface to make a CeedOperator
coerce its passive inputs so that you can fill it up using whatever precision is convenient and convert so that two copies are not persisted. I consider this a minor optimization.
I think low-high is most important for speed with simple qfunctions. For very expensive qfunctions, there may be benefit of lower precision (but numerical stability is usually not well thought out for very expensive qfunctions). Also, global vectors of double
may be necessary in some cases to maintain orthogonality in Krylov methods, though there is some interesting work (including Kasia and Steve's) to improve accuracy with reduced precision Krylov basis.
If we were to decouple Basis and QFunction precision as a libCEED feature, we'd want to be able to do it automatically, not by forcing the conversion into the user's QFunction. But for the specific experiments I'd like to run to test/demonstrate benefits of stable numerics in QFunctions, I'm happy to manage the code that converts inside the QFunction. (That is, QFunction author is entirely responsible for conversions, and libCEED doesn't need to do anything.)
I think ability to convert in Restriction is tractable without code/template explosions and would open up significant research and production options. So I would accept that scope and only consider further extensions if/when a compelling case is mode for them.
One potential issue with always doing the conversion in the restriction is that minimized memory movement will depend on the mode: for high-low, it's better to convert in the element restriction (write to float, then read from float for basis), but for low-high, it'd be better to wait until the basis application and do a local conversion (write to float, read from float + convert to double in registers vs. write to double, read from double).
That said, I have gone ahead and created two branches where I am playing around with conversion in the restriction for cuda-gen -- not in a way we'd want to merge, but to compare performance. The two versions are low-high and high-low, with everything after the conversion in the restriction happening in the same precision (and with basis data stored in the computation precision, i.e., the "second" precision). I think the exact location of the conversion to higher precision matters less for cuda-gen than the non-fused CUDA backends due to how the element restrictions work (reading into registers, not doing entire elem restriction as a separate kernel). The one snag here is that technically, not quite all computation happens in the "second" precision, due to the atomicAdd calls in the transpose element restriction needing to match the output (L-vector) precision. The addition/reduction of E-vector into L-vector isn't much computation compared to the basis or QFunction, but it is still a different situation than the input element restrictions, which just convert the data.
Good points. I'd say the conversion is logically part of the restriction, though an implementation could choose to associate/fuse it differently. I think before we put too much thought into optimizing this pattern, we'll want some real-world tests. On Gauss-Lobatto nodes to Gauss quadrature points, the condition number of the basis interpolation matrix is less than 3 and the basis derivative matrix is around 10. So this part gives up about one digit of accuracy to perform in single precision versus double. I ultimately think we may consider methods that use fixed-point 32-bit integers for some of the intermediate representations, but I think it's premature until we have use cases showing where the extra bits are useful. And I think associating with restriction is entirely reasonable for these tests and performance with cuda-gen won't care.
After some informal discussions on Slack, I have the following proposal for mixed-precision "rules" for CeedVectors
:
CeedVector
s backend data
structs will no longer contain CeedScalar *
arrays, but rather structs which can hold pointers to arrays in different precisions. struct CeedScalarArray {
size_t length;
double *double_data;
float *float_data;
< maybe other precisions if we add them in the future >
}
CeedVector
struct itself will have a new member for its preferred precision. This member will be a CeedScalarType
.CeedScalarArray
if necessary, and the data from the array corresponding to the preferred precision will be used to copy/convert to the requested precision. (Theoretically, then if we allowed more precisions in the future, a second read-only access in a third precision would still copy from whatever the preferred precision array is.) We may or may not need a way to keep track of which data arrays are currently allocated inside the struct or the vector. CeedScalarArray
will "collapse" down to only the requested access precision. Any other arrays that may be allocated will be deallocated. The vector's preferred precision will be updated to match the requested precision. (Would we want two versions of this function, such that the default version will copy/convert from the preferred precision into the new array, but another version of the function will just allocate new memory to save time when we are just going to overwrite the data anyway?)However, I realize this will cause problems, perhaps, with memory management. Right now we have a clear distinction in the backend data structs between arrays that have been allocated by libCEED vs given as a pointer, to make sure and deallocate as necessary when the vector is destroyed. A memory management system like the one above could lead to a CeedScalarArray
containing a pointer to double data owned elsewhere and a pointer to float data that was allocated by libCEED, for example. So I guess we could keep the array
and array_allocated
setup in the backend data, but where data in array
could be used to create data in array_allocated
upon read access request in a different precision? Could there be any problems there that I'm missing?
I think this is a good model. Yes, the array_allocated
would be per-precision. I don't think eager deallocation is necessary since I think it usually won't change the peak memory use.
Rough code for what I was thinking of.
typedef enum {
CEED_PRECISION_HALF = 0,
CEED_PRECISION_SINGLE = 1,
CEED_PRECISION_DOUBLE = 2,
CEED_PRECISION_DOUBLE_DOUBLE = 3,
} CeedPrecisionType;
...
char *data_ptrs[4];
...
double *double_ptr = data_ptrs[CEED_PRECISION_DOUBLE];
// Set two pointers of user arrays to use
CeedVectorSetArray(v, CEED_MEM_HOST, CEED_PRECISION_HALF, CEED_USE_POINTER, &h_ptr);
CeedVectorSetArray(v, CEED_MEM_HOST, CEED_PRECISION_DOUBLE, CEED_USE_POINTER, &d_ptr);
// Copy to half precision
// Would contain data from d_ptr converted to half
CeedVectorTakeArray(v, CEED_MEM_HOST, CEED_PRECISION_HALF, &h_ptr);
Did you want to create a separate CeedPrecisionType
when we already have CeedScalarType
? Do you think they should be separate?
Would a data_ptrs
array take the place of each of array
/array_allocated
(or array
/array_borrowed
/array_owned
as proposed in #853) in the backend's private data for the Vector?
I was thinking something like
(inside backend data)
CeedScalarArray array;
CeedScalarArray array_borrowed;
CeedScalarArray array_owned;
so the double_ptr
would be array.double_data
.
One reason I thought it would be better to stick with something like the CeedScalarArray
object is that we could use it cleanly in, e.g., edata
arrays inside Operators (in case not all E-Vectors
have to have the same precision, or just so we don't have to have separate pointers for every precision the E-Vectors
may use, even if they're all the same). However, we could probably work around this if something like the data_ptrs
setup is preferred.
(And maybe we should pick a different name for CeedScalarArray
-- like CeedData
or something more generic?)
Oh, I was just scratching down some ideas. I wouldn't replicate two enums for basically the same thing, I just didn't remember the enum names offhand.
I don't think we would want to pass around an object of different precision pointers divorced from their validity/staleness information, right? There is a lot of state management logic, and I'd think we would want to hide that inside of the CeedVector so that logic doesn't have to leak into other parts of the interface.
Do we want to adopt the convention that all QFunction data will be in the same precision? That would simplify our internal needs.
I think it's reasonable for QFunctions to use homogeneous precision because
We should probably expect the user to explicitly set the precision of the context data?
The context is just bytes. It can contain integers, strings, whatever. I was thinking of the input and output args.
Right. It just occurred to me though that we have examples of using CeedScalar in context struct definitions all over the place, but that won't work if the user has context doubles but wants to run experiments with using Operator and QFunction evaluation in single precision. The codegen backends would reasonably expect to recycle the same source for any precision operators, right?
I don't think we would want to pass around an object of different precision pointers divorced from their validity/staleness information, right? There is a lot of state management logic, and I'd think we would want to hide that inside of the CeedVector so that logic doesn't have to leak into other parts of the interface.
Do you mean re: my comment about edata
? I just meant that would be a way to allow the operator backend data to have arrays that could work for any E-Vector precision. We'd probably want to manage it such that each edata
entry only ever has one precision that it actually uses, though.
Unless that (edata
inside an Operator) is the place to use something low-level like data_ptr
(an array of them).
If I request a single pointer, then the CeedVector's state management logic knows exactly what I what I have access to and if I am reading or writing. But if we hand back multiple pointers, then how does the CeedVector's state management logic know which one I ultimately changed and which one is valid?
I suppose we would be NULL
ing out the other pointers if we request write access, so it would be obvious which pointer is valid in that case. But if we request read access we would have to explicitly request each precision that needs to be synced to make sure that we have valid data, so I'm not sure how the struct helps?
I guess I'm not seeing what passing this struct around gives us that we wouldn't get from just using an interface with a void *
and requested the target precision? It feels like I could write some bugs that would take me a while to find if I had more pointers that I could mix up.
But if we hand back multiple pointers, then how does the CeedVector's state management logic know which one I ultimately changed and which one is valid?
Hand back to what, where?
Sorry, I was previously experimenting with an interface passing CeedScalarArray
, but now I am talking about an interface which uses void *
in the Get/Set
, but with the CeedVector
(and Operator, in edata
) is using the struct to store its data internally.
Though, I do like the idea of being able to access the proper precision from a variable rather than a name...(meaning the value given in [ ]
could be a variable of the precision type)
I made some updates to the document based on our discussions today: https://github.com/CEED/libCEED/blob/mixed-precision-design/libceed-mixed-design-decisions.md
I plan to use this branch for further tests or experiments in implementation.
Also, anyone else should feel free to add/edit thoughts in the document on that branch. It's still a very rough draft of ideas and things we've talked about thus far.
Are you thinking of this edata
typedef struct {
bool is_identity_qf, is_identity_restr_op;
CeedVector
*e_vecs; /* E-vectors needed to apply operator (input followed by outputs) */
CeedScalar **e_data;
uint64_t *input_state; /* State counter of inputs */
CeedVector *e_vecs_in; /* Input E-vectors needed to apply operator */
CeedVector *e_vecs_out; /* Output E-vectors needed to apply operator */
CeedVector *q_vecs_in; /* Input Q-vectors needed to apply operator */
CeedVector *q_vecs_out; /* Output Q-vectors needed to apply operator */
CeedInt num_e_vecs_in;
CeedInt num_e_vecs_out;
CeedInt qf_num_active_in, qf_num_active_out;
CeedVector *qf_active_in;
} CeedOperator_Ref;
That probably shouldn't have been put into this struct in the first place. It is a legacy from a hacky approach I had in the Active/Passive refactor years ago before we added reference counting on CeedVector access. Really it should be eleminated.
To be clear, @jeremylt, was your suggestion of data_ptrs
for use of things like E-vector storage, or also for what we should do inside CeedVector
itself, rather than having the CeedScalarArray
structs?
At least in my mind it makes sense to leave the data wrapped in CeedVector objects as much as possible and only handle raw pointers when we must. It isn't clear to me where this CeedScalarArray
would help us?
Well, two reasons, I guess. Originally, I thought we may have a case where the user or backend functions would want to interact directly with the multiprecision data "thing," whatever that was, in a way that would work for the other languages (not just relying on void *
). Stylistically, I thought it was nice to have something that seemed less--obscure, maybe? esoteric?--than going with an array of char *
to store pointers to floating point data. I know it's a perfectly valid C strategy, but it seems kind of... old-fashioned? Especially for developers who are not coming from a C background.
But being able to access the data through a variable (a value from the type enum) as you showed above would be very helpful for the purposes of type-agnostic backend code. So, I do see that as an advantage to char *
as you have proposed.
I don't think we should ever allow a user to have more than one pointer at a time for write access. That breaks our knowledge of what data is valid.
Internally, I could see it being useful to arrange how we store the 12 pointers we would need to track inside of a vector in some consistent way or if we wanted to repeat this idea inside of basis data for interp and grad matrices, but I really think that data should only pass outside of the CeedVector in a single pointer to a single array at a time.
For our other languages, I would expect us to wrap the C API in whatever is ideomatic for that language so the user doesn't have to touch the C at all. For Rust we have to be explicit about the types so I would expect separate interfaces for getting a float or double array. I don't know about Python, but I'd expect using overloading on array types in Julia.
Yeah, I was thinking it might be more useful in the SetArray
context (can set multiple data pointers at once if multiple precisions are valid or will be used) but since we're going with "one-per call" and void *
that doesn't really matter anymore.
Internally, I could see it being useful to arrange how we store the 12 pointers we would need to track inside of a vector in some consistent way or if we wanted to repeat this idea inside of basis data for interp and grad matrices
I was indeed thinking that we could use whatever storage is in CeedVector
inside other components, like basis objects or operators. But we could use the same type of storage across backend objects no matter what type of storage it is (struct or pointer array). And of course, if it's only inside the backend data and implementations, I guess different backends could do different things, though consistency across backends would probably be better.
Just a small note that it can be a struct containing an array, but shouldn't be a raw array (less type checking/harder to extend).
@jedbrown do you mean you prefer sticking with something like CeedScalarArray
with members for each currently-possible precision type? Or using a char *
array as @jeremylt suggested, but wrapped in a struct? Like
struct CeedScalarArray {
size_t length; //this may or may not stay? maybe just the ptr array?
char *values[CEED_NUM_PRECISIONS];
}
then if we have a member array
of type CeedScalarArray
inside the backend data of our CeedVector
, we access the precision we want through
(float *) array.values[CEED_SCALAR_FP32]
(or whatever)
Or some other option?
Like this is good in my opinion, I just want it inside a struct (like you suggest) rather than a raw array of pointers. Open question of whether these should be char*
or void*
. The latter can't participate in pointer arithmetic, which might be good.
char *p = array.values[scalar_type];
p + 100 * bytes_per_scalar; // valid
void *q = array.values[scalar_type];
q + 100 * bytes_per_scalar; // invalid!
void*
is appropriate for functions like memcpy
. I think I lean toward void*
so that doing pointer arithmetic becomes a more deliberate act.
That's a good point about pointer arithmetic. Maybe an array of void *
inside a struct would be better.
Based on our discussion yesterday (and previous discussion), I decided to make up some schematics to hopefully clearly show my interpretation of some options that have been proposed, and the pros/cons.
In the following schematics:
The QFunctions are attempting to represent that we have "two" QFunctions, really: the libCEED QFunction kernel/function, and the user QFunction which is the function describing operator physics and which matches the signature of the function pointer type defined in libCEED.
Let's assume we can set the following precision options at an Operator level in libCEED:
If we start with the simpler option for computation precision, that all Basis operations match the QFunction precision, this is what would happen to float and double inputs/outputs with float computation precision:
And this would be the double precision computation case:
However, as mentioned in the discussion, the basis computations are generally very well-conditioned, and don't need higher precision for the sake of accuracy if the operator input is already in lower precision anyway. My experience is mostly in the GPU backends, where something like this could potentially have a benefit (especially if all inputs/outputs are float, but we want to do QFunction in double due to conditioning concerns):
In this, case the little cvt
squares indicate that the libCEED QFunction would handle local conversion to/from the higher precision, as part of what libCEED does "around" the user-defined QFunction. In the GPU backends, this would look like adding a conversion to the device functions which read the Q-Vectors into registers in the correct ordering, and shouldn't really cost much of anything. I'm not sure how easy this would be to implement in the CPU backends (especially without the benefit of JIT).
However, is this too confusing/surprising to have these different options based on configuration combinations of input/output precisions and computation precision? It would also be "asymmetric," as the float computation precision case would never worry about this (we'd always want to convert to float as soon as possible anyway).
This issue is for discussions related to adding mixed precision capabilities to libCEED, including:
An idea from @jedbrown was to handle precision conversions through the
CeedVector
objects, which could be modified to store multiple pointers for different precisions, along with a tag to indicate which precision that vector's data currently uses.