GraphBLAS / graphblas-api-c

Other
7 stars 3 forks source link

export of a matrix in full format when entries are missing: should be an error #42

Closed DrTimothyAldenDavis closed 2 years ago

DrTimothyAldenDavis commented 3 years ago

Scott mentions:

[Scott: How can I export a sparse GraphBLAS matrix to a dense format without specifying what the unstored value is?]

You can't, and we shouldn't state anywhere what the implicit value is. That notion doesn't appear in GraphBLAS at all, except perhaps as the monoid identity value but even with that, the identity value is not tied to a matrix or vector, but to an operator. I would argue that attempting to export a matrix with any entries missing leads to an error, and the export fails.

BenBrock commented 3 years ago

I would like to avoid, if at all possible, a situation where a program is able to export fine with a particular dataset, but upon changing the dataset, suddenly starts crashing. Here's my preferred way of dealing with this:

Modify the dense format so that it's not just a dense array, but a dense array accompanied by an array of bits indicating which locations in the matrix contain stored values. This way, you can always export to any format, no matter what your data happens to be.

More formally:

Redefine the GrB_DENSE_ROW_FORMAT format. Here, the values array points to an array of size number of columns times number of rows, where element i,j is located at index i*ncols + j. indices will be an array of size ncols*nrows / (sizeof(GrB_Index)*8) (in C arithmetic). The nth bit of indices indicates whether the correspond location (i=n/ncols,j=n%ncols) contains a stored value. That is to say the C expression indices[(i*ncols + j) / (sizeof(GrB_Index)*8)] & (GrB_Index(0x1) << ((i*ncols + j) % sizeof(GrB_Index)*8)) indicates if there is an element stored at location i,j in the matrix. If the expression evaluates to true, there is an element stored at location i,j, and if it evaluates to false, there is no element at that location. The value of values[i*ncols + j] is undefined for locations i,j that do not contain stored values. The indptr array is still unused.

GrB_DENSE_COL_FORMAT is the same, just swap i*ncols + j for i + j*nrows.

My rationale for changing the dense row format is that a matrix that is stored as dense internally must be accompanied by a bitmap anyway, or else some other way to indicate which matrix locations do not contain stored values, otherwise you'd have to switch matrix formats entirely as soon as a single element is deleted, and you couldn't use dense storage for mostly dense matrices.

This change would apply both to import and to export, meaning that users would need to provide the bit array on input as well. We discussed making the bit array optional. For input, I think this is simple enough: if the user passes in NULL for indices, we can assume that every element in the array values corresponds to a stored element. I think this is relatively error-free, since if you have a dense matrix yourself, you're likely to have something else stored in every location. Otherwise, you'd have some other data structure (e.g. a bit array) along with it, and would likely think to provide that information to GraphBLAS as well.

For output, we could do the same thing (if users provide NULL for indices, don't write anything to it), but that is somewhat more dangerous. It opens the door up for programs that are correct on certain datasets (when everything is dense), but then become incorrect if you change the dataset. Personally, I'd say require the bitsets for both import and export, but I'll allow others to comment.

In terms of the cost of copying the bit array, it's pretty minimal, particularly in the case where you didn't need it (you're just memset'ing a bunch of ones). Even so, in the worst case it adds 1/8 to memory usage (if your elements have a size of one byte) and probably adds more like 1/32 or 1/64. Small constant factors here.

If you're staunchly opposed to adding a bit array to the dense formats, I'd personally be in favor of removing dense formats entirely, at least for GraphBLAS 2.0. COO is easy enough to use, and only adds a (somewhat larger) constant factor of memory usage anyway.

DrTimothyAldenDavis commented 3 years ago

It could be done, but it's not really a dense matrix format. I have this format already, as the GxB_BITMAP format. The pattern array is not actually a bit matrix, but an array of bools.

If a matrix is truly dense then the bitmap (bool or bit) is a lot of overhead and you don't want to import/export it. Dense vectors are very common and they should be very fast and simple; no pattern or indices array for truly dense vectors/matrices when all entries are present.

It's easy for a user application to determine if all entries are present (GrB_Matrix_nvals would equal nrows*ncols). I don't see this as a "crash" if an attempt is made to export a matrix as dense but one or more entries are not in the pattern.

DrTimothyAldenDavis commented 3 years ago

Another point: we already have a precedence for a method "failing" if an entry is missing. When passing a GrB_Scalar to many methods, the scalar must have an entry. If it doesn't have one, an error GrB_EMPTY_OBJECT is returned.

The error is returned for the same reason an export of a matrix in DENSE_ROW/COL format fails: an entry is missing so the action cannot be taken.

So I see no reason for a DENSE_ROW/COL export to work only if all entries are present.

A dense matrix or vector is a very simple data structure for a user to handle, unlike the BITMAP format which is much more complex. If suddenly we return the matrix/vector in BITMAP format instead of a DENSE format, the user code has to handle this case, which is likely unexpected. We should not return BITMAP if the user is not expecting it.

BenBrock commented 3 years ago

If a matrix is truly dense then the bitmap (bool or bit) is a lot of overhead and you don't want to import/export it. Dense vectors are very common and they should be very fast and simple; no pattern or indices array for truly dense vectors/matrices when all entries are present.

I would argue that the bitmap is not a lot of overhead. In the worst case, the dense matrix contains a type that is 1 byte in size. Adding an extra bit alongside that adds 1/8 the size in overhead. (1.125x memory usage). More likely, your matrix contains 4- or 8-byte types, for 1/32 or 1/64 the size in overhead (~1.032x or ~1.016x memory usage).

Also, in the truly dense case, this is a memset operation, not a memcpy, so it can be done more cheaply as well.

A dense matrix or vector is a very simple data structure for a user to handle, unlike the BITMAP format which is much more complex. If suddenly we return the matrix/vector in BITMAP format instead of a DENSE format, the user code has to handle this case, which is likely unexpected. We should not return BITMAP if the user is not expecting it.

I definitely think this is worth thinking about. If adding a bitmap makes things too complicated for users, we shouldn't do it. Here are my thoughts on this (assuming we don't make indices optional for either import or export):

When exporting, the bitmap is simple to work with. The expected size of indices will be output by GrB_Matrix_exportSize, and all the user needs to do is allocate a buffer that big. They can choose to ignore indices after exporting (possibly at their peril, if they don't check that the matrix was full).

So we add two extra lines of code to export, one to allocate a buffer for indices, and one to deallocate that buffer.

When importing, the user will now have to pass in a bitmap for indices. I will admit that the math for calculating the number of GrB_Index elements you need to have nrows*ncols bits can be slightly messy, but it is fairly straightforward. The number of elements needed for indices is equal to ceil(nrows*ncols / (sizeof(GrB_Index)*CHAR_BIT)). (CHAR_BIT, the number of bits in a byte, is almost universally eight, but let's be super general here.) So you need to allocate this many GrB_Index elements, then memset it.

Here's an example of what this might look like, defining a WORDS_PER_BITARRAY macro to make things slightly neater.

#define WORDS_PER_BITARRAY(wordsize, num_bits) (num_bits + wordsize*CHAR_BIT - 1) / (wordsize*CHAR_BIT)

// We need a bit for each element
size_t num_bits = nrows*ncols;
// The number of bits in each word.
size_t num_elem = WORDS_PER_BITARRAY(sizeof(GrB_Index), num_bits);

GrB_Index* indices = (GrB_Index *) malloc(sizeof(GrB_Index)*num_elem);
memset(indices, -1, num_elem*sizeof(GrB_Index));

I'll admit this is annoying, particularly if you're just filling it with all ones. I would propose allowing the user to pass in NULL for indices, at least for import.

DrTimothyAldenDavis commented 3 years ago

The bit array is trickier than a bool array, but both of them are very different than a dense matrix. If a user wants to write code for a dense matrix, it's very easy. Asking them to work on a bitmap is asking a lot, particularly if they are expecting only a truly dense array. They'd have to write code to handle both, which is difficult.

Supporting a bitmap is a good idea in general (although I really prefer one byte per entry not one bit, since bit fiddling is so tricky). HOwever, it should not be mixed at all with the GrBDENSE*_FORMAT. It should be its own format, GrB_BITMAP_ROW_FORMAT say. Then if some entries are missing, exporting in GrB_BITMAP_ROW_FORMAT works fine, and the user gets what they expect: a bitmap plus an array of values.

If the format is GrB_DENSE_ROW_FORMAT, the user should provide just one dense array of values on input, and expect to get just one dense array of values on output, with no bitmaps mixed in.

Changing formats like this on the user is awkward. They should get what they ask for, and if it cannot be done (like exporting a matrix with missing entries in GrB_DENSE_ROW_FORMAT) they should get an error ... because if they are doing that then they expect it to be truly dense, and missing entries is a bug. Punting to a bitmap is not a good idea in this case.

mcmillan03 commented 2 years ago

At this time we are not going to support a dense format. There are too many issues to resolve to meet the Version 2.0 deadline. Ben will remove the format and the issue will be put on the backlog.

DrTimothyAldenDavis commented 2 years ago

That sounds like a good plan.

DrTimothyAldenDavis commented 2 years ago

I've removed the GrBDENSE*_FORMAT from my latest drafts, v5.2.0.alpha16 and v6.0.0.alpha16. I left the code that implemented them commented out, in case they get added back later.

DrTimothyAldenDavis commented 2 years ago

I think this issue can be closed now, right? We could add an issue for the v2.1 spec, to support GrB_DENSE_ROW_FORMAT, GrB_DENSE_COL_FORMAT, and the bitmap variants as well.

BenBrock commented 2 years ago

Closing this issue, since dense formats have been removed from the spec (for now).