info type discussions for mem-based optimizations

SparseBLAS / spblas-reference

BSD 3-Clause "New" or "Revised" License

10 stars 2 forks source link

info type discussions for mem-based optimizations #1

Open spencerpatty opened 12 months ago

spencerpatty commented 12 months ago

https://github.com/SparseBLAS/spblas-reference/blob/093068008c1f3e031fb43133d1a66051aa04b80a/notes/spmv.hpp#L10-L24

I like this idea of having an info type that is directly associated with some matrix structure and which is filled with 0 or more inspection based optimizations (which means it houses "stateful + read-only" optimizations. I wonder if it would be possible to have our multiply functions take in some hybrid matrix_obj object which consists of either a matrix_view or a matrix_view + an associated matrix_info_t type -- used in some way like the following snippet

csr_view<T,I,O> A(...);
matrix_info_t A_info(...);
multiply_inspect( matrix_obj{A, A_info}, descriptor, x, y, /*backend stuff*/ ) 
multiply_execute( matrix_obj{A, A_info}, descriptor, x,y, /*backend stuff */)

or we might also skip the inspection at the cost of less performance...

csr_view<T,I,O> A(...);
multiply_execute( matrix_obj{A}, descriptor, x,y, /*backend stuff */)

The benefit of this is that when we look at the sparse * sparse operation, we could have an A_info, B_info, that may contain good (read-only stateful) stuff that might be useful about A, B while creating C, and then there may be another multistage_info_t which is particular to the multi-stage operation (stateful + read/write data)

csr_view<T,I,O> A(...);
matrix_info_t A_info(...);

csr_view<T,I,O> B(...);
matrix_info_t B_info(...);

csr_view<T,I,O> C(...);

multiply_info_t  mult_info(); // C = A *B^T

multiply_inspect(matrix_obj{C}, matrix_obj{A, A_info}, transpose(matrix_obj{B,B_info}), desc, /*backend stuff*/ ); // fills A_info and or B_info
multiply_execute_stage1( matrix_obj{C}, matrix_obj{A, A_info}, transpose(matrix_obj{B,B_info}), mult_info, /*backend stuff*/ ); // fills mult_info and C
multiply_execute_stage2( matrix_obj{C}, matrix_obj{A, A_info}, transpose(matrix_obj{B,B_info}), mult_info, /*backend stuff*/ ); // fills mult_info and C
multiply_execute_stage3( matrix_obj{C}, matrix_obj{A, A_info}, transpose(matrix_obj{B,B_info}), mult_info, /*backend stuff*/ ); // fills mult_info and C

mult_info might house the stateful + read/write optimizations pertaining to the multiply multi-stage process ... A_info and B_info might pertain stateful + read-only optimizations about A and or B ...

Does this idea make sense? Does anyone see any use issues ? Is it too ugly ? I worry that we will have too many overloads if we have A and possibly A_info etc separated as inputs ... And this allows us to distinguish between matrix inputs + info and operational info data ...

spencerpatty commented 12 months ago

@BenBrock is there a nice C++ way to define such a "matrix_obj" input that could pull together a view or a view + info or a matrix obj or other inputs that represents matrix data + optional data :)

BenBrock commented 12 months ago

Yes, I think we could do this and still have pretty nice-looking syntax. I'm not sure of the top of my head what the best way to keep the complexity down is, but I'll think about it.

Just so I understand fully---the idea is to add a matrix-specific inspect in addition to the operation inspect? (Presumably if you do matrix-specific inspections, you'd also want to pass those to the operation inspect as well?)

csr_matrix<float> a(/* pointers to data, matrix shape, etc. */);
auto a_info = inspect(x, /* user-provided hints */);

auto m_info = multiply_inspect(bind_info{a, a_info}, /* b, c, etc. */);
multiply_execute(m_info, bind_info{a, a_info}, /* b, c, etc. */);

BenBrock commented 12 months ago

The matrix + info object is basically a view that also stuffs in the matrix info object. As long as we can take a view of the object (if it's not already a view), this should be relatively straightforward to implement. I will try to implement a mock-up.

spencerpatty commented 12 months ago

Yes, I think we could do this and still have pretty nice-looking syntax. I'm not sure of the top of my head what the best way to keep the complexity down is, but I'll think about it.

Just so I understand fully---the idea is to add a matrix-specific inspect in addition to the operation inspect? (Presumably if you do matrix-specific inspections, you'd also want to pass those to the operation inspect as well?)
csr_matrix<float> a(/* pointers to data, matrix shape, etc. */);
auto a_info = inspect(x, /* user-provided hints */);

auto m_info = multiply_inspect(bind_info{a, a_info}, /* b, c, etc. */);
multiply_execute(m_info, bind_info{a, a_info}, /* b, c, etc. */);

Yes exactly, with caveat that a_info (stateful + read-only optimizations based on A matrix) could still be filled from multiple calls to _inspect() ...

csr_matrix<float> a(/* pointers to data, matrix shape, etc. */);
matrix_info_t a_info(); // initialized
multiply_inspect(a_info, a, x, y, /* user-provided hints */); // multiply operation with A matrix hints are added to a_info here
multiply2_inspect(a_info, a, x, y, /* user-provided hints */); // multiply2 operation with A matrix hints are added to a_info here
//etc

for (/* one or more loops */) {
  multiply_execute(bind_info{a, a_info}, x, y, /* user-provided hints */);
  multiply2_execute(bind_info{a, a_info}, x, y, /* user-provided hints */);
  //etc
}

This might come in handy for the multi-stage apis which could have stateful+read-only data for one or more sparse matrices and which also have the stateful+read/write data to persist through the multiple stages...

csr_matrix<float> a(/* pointers to data, matrix shape, etc. */);
matrix_info_t a_info(); // initialized

csr_matrix<float> b(/* pointers to data, matrix shape, etc. */);
matrix_info_t b_info(); // initialized

csr_matrix<float> c(/* matrix shape, etc, but unfilled with data */);

spgemm_info_t m_info(); // constructor for C = op(A) * op(B) info handle

multiply_inspect(a_info, a, x, y, /* user-provided hints */); // multiply operation with A matrix hints are added to a_info here
multiply_inspect(m_info, bind_info{a, a_info}, bind_info{b,b_info}, c, /*user provided hints*/); // may add some things to a_info, b_info or m_info

// do spmv operation:  y=A*x
multiply_execute(bind_info{a, a_info}, x, y, /* user-provided hints */); // do A only operation

// do spgemm operation:  C = A * B
multiply_execute_stage1(m_info, bind_info{a, a_info}, bind_info{b, b_info}, c, /*user provided hints*/);
multiply_execute_stage2(m_info, bind_info{a, a_info}, bind_info{b, b_info}, c, /*user provided hints*/);
// extract sizes from m_info for c
// allocate c data arrays and fill into c csr_matrix<float>
multiply_execute_stage3(m_info, bind_info{a, a_info}, bind_info{b, b_info}, c, /*user provided hints*/);

BenBrock commented 12 months ago

Yes exactly, with caveat that a_info (stateful + read-only optimizations based on A matrix) could still be filled from multiple calls to _inspect() ...

csr_matrix a(/ pointers to data, matrix shape, etc. /); matrix_info_t a_info(); // initialized multiply_inspect(a_info, a, x, y, / user-provided hints /); // multiply operation with A matrix hints are added to a_info here multiply2_inspect(a_info, a, x, y, / user-provided hints /); // multiply2 operation with A matrix hints are added to a_info here //etc

for (/ one or more loops /) { multiply_execute(bind_info{a, a_info}, x, y, / user-provided hints /); multiply2_execute(bind_info{a, a_info}, x, y, / user-provided hints /); //etc } This might come in handy for the multi-stage apis which could have stateful+read-only data for one or more sparse matrices and which also have the stateful+read/write data to persist through the multiple stages...

I'm a little confused by this code example. Is a_info meant to hold general information about the operation, or information specifically about a? In my mind, we could have two kinds of infos:

1) operation_info_t: this comes from multiply_inspect and holds generic information for the operation. (This could contain supplemental data/copies of one or more matrices.) 2) matrix_info_t: this would be created in a separate operation, unless we bound it to a particular matrix and passed it into multiply_inspect.

I'm assuming we have the following API:

// You can provide a pre-existing operation_info_t, in which case the new info will be attached to it.
template <matrix A, vector B, vector C>
void multiply_inspect(operation_info_t& info, A&& a, B&& b, C&& c);

// Or you can just get a new info.
template <matrix A, vector B, vector C>
operation_info_t multiply_inspect(A&& a, B&& b, C&& c);

// And then you pass in the operation info when executing.
template <matrix A, vector B, vector C>
void multiply_execute(operation_info_t info, A&& a, B&& b, C&& c);

Are there things that are easy to do with matrix_info_t that are difficult to do with operation_info_t? In theory you could store anything you want inside operation_info_t.

spencerpatty commented 12 months ago

I guess I am thinking that there are things that pertain to a single matrix (possibly multiple optimizations for different operations, but that all pertain to a particular matrix) -- these are the things that the <operation>_inspect() api would fill.

There are other things which pertain to a particular operation which involves more than one matrix (think C = A * B spgemm), we could either associate those things with the C (output) matrix or with the spgemm operation itself ...

I am ok with either actually, so maybe the way to merge these ideas is to have all matrices have an associated info type which contains things that might pertain to creation of it or use of it for subsequent operations ... Alternatively, we have the matrix_info_t be for use of matrix in subsequent operations and multistage_operation_info_t, things which pertain to passing internal state through a multi-stage operation ...

chrwarm commented 12 months ago

I like this idea of separating out matrix information, matrix_info_t which can potentially be reused in multiple different operations, from operation specific information, operation_info_t which is tied to a particular combination of say "multiply" matrix operands.

Are there things that are easy to do with matrix_info_t that are difficult to do with operation_info_t? In theory you could store anything you want inside operation_info_t.

The current approach in Arm PL is to hide everything behind one opaque matrix type, which is simple, but the problem with comes when trying to reuse the same matrix object in different contexts - some of the data structures/optimizations created for SpMM will be different to those for triangular solve, for example, hence the user would have to create two matrices, re-optimize, or live without optimizations and stick to a non-owning view.

We had discussed 3 different types of matrix object, which are nicely summarized in notes/matrices.hpp as 1) a "view", 2) an opaque object, and 3) a known-format. I'm wondering if these matrix_info_t and operation_info_t types would replace the opaque matrix type and be things that come out of the inspect/optimize call.

Also, I'm not sure I see the need for supporting known-format objects (e.g. a transparent CSR type). To simplify the interface, could we require users create a "view" of their own arrays, and then optionally create these associated opaque matrix_info_t and operation_info_t types? In my mind that simplifies the interface, and covers users who only want libraries to work with their own data (views only) and those who are happy for libraries to allocate (view + matrix_info_t and/or operation_info_t).

spencerpatty commented 12 months ago

yeah I am starting to see the matrix_view + optional "opaque handle" matrix_info_t as a nice path forward, -- one or more stateful optimizations which include memory/data structures pertaining to the matrix_view object for one or more subsequent operations could be stored in the "opaque handle" matrix_info_t object, in fact a user could have more than one of the matrix_info_t objects that match to a particular "matrix_view" data for certain operations that happen together, and the user has complete control over the lifetime of those "opaque handle" matrix_info_t objects.

Of course there are other types of stateful data that usually pertain to the multi-stage operations and which don't pertain to a single particular matrix, but to a multi-phase operation (think spgemm or format conversions here), which might need their own "opaque handle" <operation>_multiphase_info_t object to house such data through the multiple phases (and have possible reuse for subsequent calls to similar sparsity patterned operation, possibly with different data values ...)

This binding operation in the APIs that we were discussing at the beginning of this issue discussion above might be convenient to clearly tie such an "opaque handle" matrix_info_t to a matrix_view in that operation...

Also, I'm not sure I see the need for supporting known-format objects (e.g. a transparent CSR type). To simplify the interface, could we require users create a "view" of their own arrays, and then optionally create these associated opaque matrix_info_t and operation_info_t types? In my mind that simplifies the interface, and covers users who only want libraries to work with their own data (views only) and those who are happy for libraries to allocate (view + matrix_info_t and/or operation_info_t).

^^ @chrwarm I am not sure I understand what you mean by "users creating their own views" ? I was thinking the matrix_view was a lightweight container for the user provided raw data in some known format for which we have kernels implemented inside our library, but we also have opportunity to analyze such a matrix_view for a particular operation in one of the <operation>_inspect() apis and store away extra stuff which may lead to better kernel execution strategies or performance with other kernels implemented inside. Is that similar to what you were thinking ?

spencerpatty commented 12 months ago

perhaps matrix_info_t is not a good name -- maybe matrix_opts_t is better for what I am envisioning ? and <operation>_handle_t for the multiphasic one ?

I reread your initial thoughts, @chrwarm, and I think I am starting to see what you are envisioning, let me see if I can restate it in my own words (and if I got it :) ) -- user has data in some format that is a matrix_view, they provide it to an <operation>_inspect() and some things extracted that are generally reusable for multiple operations pertaining to the matrix are stored into a matrix_info_t object, and the things that are particular to that specific operation (possibly from multiple matrices, or as a different representation of the matrix ?) would be stored in an operation_info_t object ? so at the end, a user might have:

matrix_view + matrix_info_t + one ore more operation_info_t objects and for an operation, each matrix input might have matrix_view + matrix_info_t, and then there is an operation_info_t as well to be passed in ?

If this is what you meant, then I think we are actually pretty close to talking about the same things. If not, then I may need some clarifications :)

chrwarm commented 12 months ago

@spencerpatty yes, what you describe in your last post is a good description.

I think the matrix_info_t and operation_into_t objects can be optional in the workflow - i.e . users could have only a matrix_view and then if they don't want any more memory to be allocated then they would just call the execute (e.g. multiply) function.

csr_view A(...);
csr_view B(...);
csr_view C(...); // for now assume we have some C to accumulate into
matrix_info_t c_info = multiply_inspect(C, B, A); // info captured about new sparsity structure of C - this info goes into c_info

// We could then offer 3 possibilities, probably giving increasing performance:

// 1: non-allocating multiplication using user's arrays via the views of A, B, C. User would have to export result from {C, c_info} 
multiply(bind{C, c_info}, A, B); // "vanilla" CSR kernel here

// OR:
// 2: we have extra information/structures for the matrices, but not operation specific information
auto a_info = matrix_inspect(A, /* hints here */); // store operation-agnosic structures related to A into a_info
auto b_info = matrix_inspect(B, ...);
multiply(bind{C, c_info}, bind{A, a_info}, bind{B, b_info});

// OR:
// 3: full optimization: we have matrix and operation information to use
auto op_info = multiply_inspect(bind{C, c_info}, bind{A, a_info}, bind{B, b_info}); // optimization specific to these operands
multiply(op_info, bind{C, c_info}, bind{A, a_info}, bind{B, b_info});

spencerpatty commented 11 months ago

I think the matrix_info_t and operation_into_t objects can be optional in the workflow - i.e . users could have only a matrix_view and then if they don't want any more memory to be allocated then they would just call the execute (e.g. multiply) function.

@chrwarm Generally I agree, but for the low level kernels, the operation_info_t (for multiphase apis) will likely still be necessary, in my opinion.

I wanted to circle back on matrix_info_t use, I guess in your current implementation for ARM APIs, you only have a single operation's optimization active at a time, right? whereas in Intel's Insepctor executor APIs, we may have multiple operation optimizations active at a time (based on the hints).

Coming from the possibly multiple operations having active optimizations perspective, I was thinking all of those operation's optimizations that are requested to be created for a given matrix could go together into the matrix_info_t object and are filled in by each call <xyz>_inspect() when it is passed as an argument to it. Note that there are the possible exception of optimizations that are for a multi-phase execution that involves more than one matrix (again thinking spgemm or conversions), which might be a small subset of the operations supported and which would have a particular <operation>_info_t object, like spgemm_info_t or format_converter_info_t that is also part of the API calls...

In this case, the matrix_info_t is definitely optional for improved performance, whereas the others might not be...

With your ARM design, I guess I would still think of the matrix_info_t as housing any matrix specific optimizations and you can pass it in to each of the <xyz>_inspect() APIs similarly, but in your case, on doing so, you could remove one optimization and replace it with the other one similar to what I think you were describing as your current pattern? Does that make sense ?

chrwarm commented 11 months ago

@spencerpatty I think I understand what you're heading towards.

We do only keep one representation of the matrix based on the last-optimized operation at the moment in Arm PL. We could start to save the union of all structures created by different optimize/inspect calls, but I have a slight concern about keeping those up to date. E.g. if the user updates values in the original matrix what happens then - do we update all our data structures under the matrix_info_t object, or do we invalidate it? It could become messy. On the other hand, passing back those optimizations in the form of operation_info_t (as I suggested) only moves the burden of managing them on to the user.

On balance, I think I'm in favour of having a smaller interface and hiding as much as possible behind matrix_info_t except for the multi-stage stuff, as you suggest above. This fits well with what we have (not that that should be a big constraint!), and I guess the number of operations that will be active with a particular matrix will be small too, so maybe the burden of managing multiple optimizations will be small.

It would be good if you could draft up an example workflow of the interface in action for the different spgemm use-cases as you currently see it?

spencerpatty commented 11 months ago

Yes, I think the burden of managing optimizations when values might be changed will likely always be somewhat shared by user (unless we put everything in the matrix_view itself, which we are trying to move away from), but I could envision an update_matrix_values() API which takes in a new array of values, a matrix_view and zero or more matrix_info_t objects associated with that matrix_view and makes the change in matrix_view and any passed-in matrix_info_t objects. So the user has complete control over which if any of the matrix_info_t objects it wants to update to the new values ...

I want to point out that the whole goal of having the matrix_info_t separate from the matrix_view is to allow the user complete control over lifetime of optimizations, and with this model, the user could choose to put certain optimizations in one matrix_info_t and others in a separate one (for instance if an iterative solver is going to be using forward and backward triangular solve and SpMV, a common set for Conjugate Gradient algorithm, those could be put into one matrix_info_t, but in other places, an SpMM or SpTRSM operation are more commonly used together, those could be optimized for in a separate matrix_info_t object or objects).

I will definitely try to prototype the SpGEMM use case I am envisioning and share it here...

spencerpatty commented 11 months ago

I discussed much of this with @BenBrock offline and I believe that we will have some version of the SpGEMM use case presented in our next meeting ... if not, I am happy to provide it after that point

chrwarm commented 11 months ago

@BenBrock, @spencerpatty thanks for the slides yesterday, it's now really clear and I like this direction a lot, but it's worth exploring possible limitations as well I think...

We seem to be settling on views, which are nice, and I appreciate the benefits, but are there cases where they don't work well? I can think of 2 possible objections users might have.

Large matrices. I mentioned this yesterday. If there are use cases where an application is running close to the memory limits of a system, then users are likely going to want optimized operations on such large matrices by allowing the library to take a copy and optimize/change format, allowing users to free the original arrays. Using views without the inspect/optimize call would forgo library optimization; alternatively, making the inspect/optimize call would likely lead to running out of memory, because the user's original arrays cannot be freed before the view object is destroyed.
Might some users who want optimization find managing at least 2 data structures for a single matrix burdensome or confusing? Subjective, but it's a possibility, especially if someone just wants to do a single one-matrix operation (e.g. SpMV) with their matrix.

In both cases a possible solution is to allow users to control ownership of data via the constructor of the initial object with e.g. a bool.

spmatrix_t a1(row_ptr, col_indx, vals /*, false*/); // effectively, a view
spmatrix_t a2(row_ptr, col_indx, vals, true); // library to take ownership; user can free row_ptr etc.

If users create as a1, then they are effectively choosing a view and we're back in business.

If they create as a2, the library has control over memory and users could free their input arrays. This addresses objection 1 above.

Creation as a2 doesn't preclude having multiple matrix info objects that contain optimization data for different contexts. If users choose to create info objects, the library may then choose to treat a2 as a simple handle, freeing up most of the data copied/moved in on creation, if information held in it is made redundant by what is in the info objects.

If objection 2 is valid one, maybe we could overload inspect functions so that they can either return separate info objects as proposed, or embed/hide info objects inside a2, thereby simplifying what the user has to carry around.

// Create separate matrix infos
auto info = multiply_inspect(bind_info{a1, a1_info},
 bind_info{b, b_info}, c);

// Or just hide a2_info in a2 (throwing an error if a2 already has an info in it?)
auto info = multiply_inspect(a2,
 b, c);

Just some thoughts. After writing all this down, maybe overloading adds even more confusion!? And maybe those objections can be swept aside - it would be nice if they could be!

spencerpatty commented 11 months ago

@chrwarm thank you for thinking about these things -- the limitations. I absolutely agree with your assessment of them. :)

In particular, case 1 is going to be challenging from all directions -- both for users to do operations with their matrix and for libraries to try to optimize it in some way to make things faster. It is something we should consider in our designs. In some sense, if it is really that close to the edge, maybe it is worth having multi-device solutions as an extension exactly for cases that don't fit well ?

Case 2 is more tricky -- it is well covered by the full blown object design (not views) where control is given over to library of the object. Maybe that could come later as an extension as well ? Is there any reason we wouldn't want to have views and objects be supported in the final solution even if we start with views? I can see the objects really optimizing for simplicity -- single calls and library does everything ... but maybe top performance isn't quite as necessary, so allocation overheads are ok ? In this case wouldn't we still want the inspect/execute paradigm? or would it be ok to have a single execute function which could do optimizations on the fly ?

your possible workaround suggestions are actually interesting as a way to have the view objects morph between view and full blown objects ... it is definitely worth thinking about :)

chrwarm commented 11 months ago

Hi Spencer,

There is a lot to think about, we may have to prioritise some of these issues and pick out use cases that will not be well supported initially...

Is there any reason we wouldn't want to have views and objects be supported in the final solution even if we start with views?

No, I think starting with views is a good approach actually.

In this case wouldn't we still want the inspect/execute paradigm? or would it be ok to have a single execute function which could do optimizations on the fly ?

Inspect/execute still makes sense with objects - that's what we have now, right?

Chris