GraphBLAS / binsparse-specification

A cross-platform binary storage format for sparse data, particularly sparse matrices.
https://graphblas.org/binsparse-specification/
BSD 3-Clause "New" or "Revised" License
15 stars 4 forks source link

add reference output #39

Open willow-ahrens opened 10 months ago

willow-ahrens commented 10 months ago

This PR adds reference output for binsparse matrices and vectors @BenBrock @ivirshup. fixes #38

github-actions[bot] commented 10 months ago

Automated Review URLs

BenBrock commented 9 months ago

Hi @willow-ahrens, just took a look and ran into two issues with my parser:

1) My parser reads all the binsparse matrices as having shape [0, 0], which I don't think is correct. Not sure if this is a parsing issue or an error in the files. 2) There's no "version" key in the files, which my parser expects. Maybe we can add a "version" key equal to 0.1?

willow-ahrens commented 9 months ago

I think it should be fixed now!

BenBrock commented 4 months ago

I'm still not seeing a version in the HDF5, which is causing my parser to fail. Here's what I see in hdf5dump for the binsparse JSON in /examples/reference/b1_ss_COO/b1_ss_COO.bsp.h5:

   ATTRIBUTE "binsparse" {
      DATATYPE  H5T_STRING {
         STRSIZE 328;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "{
               "binsparse": {
                   "format": "COOR",
                   "fill": true,
                   "shape": [
                       7,
                       7
                   ],
                   "data_types": {
                       "indices_0": "int64",
                       "indices_1": "int64",
                       "values": "float64",
                       "fill_value": "float64"
                   },
                   "attrs": {}
               }
           }
           "
      }
   }

Could we add a version tag with value 1.0? We should also perhaps discuss how version checking should work at some point. (e.g. should a non-tensor parser try to accept everything in [1.0, 2.0) and fail upon encountering something unparsable, or instead only accept something in [1.0, highest_known_nontensor_version]?)

willow-ahrens commented 4 months ago

oops! I fixed Finch but I didn't regenerate the reference files. I have checked there is a version, it looks like it's currently set to 0.1. Should we set it to 1.0?

Regarding tensor vs. matrix formats, I think that tensors should be v1.0, but tensor parsers should normalize to the matrix format tags whenever possible. If it helps, we could define a key that tells us if there is tensor stuff.

Whether or not there are tensor formats feels orthogonal to 1.0 vs 2.0. We may one day want to upgrade matrix version from 1.0 to 2.0, but that shouldn't have to mean the upgraded version needs to support tensors.

BenBrock commented 4 months ago

I can see the version there now, but it's stored as a string, not a number ("1.0" vs. 1.0), which creates some problems for my parser. Could you update again?

That makes sense about adding a key for the tensor extensions. I added a note to the agenda for our next meeting about that.

willow-ahrens commented 4 months ago

Are we sure we want to store versions as numbers? you mean a floating point number?

DrTimothyAldenDavis commented 4 months ago

It would be best to have it as a string or as an array of 3 integers. Version 1.12 > 1.2 for example.

DrTimothyAldenDavis commented 4 months ago

And also a number can’t be 1.2.3.

BenBrock commented 4 months ago

Okay, so it sounds like we're going with a string version, where the string must satisfy the regex ^\d+.\d+$. [illustrative examples]

e.g., there's a major and minor version number only.

BenBrock commented 4 months ago

@willow-ahrens I can read the binsparse files successfully now, but I have a couple of issues with the Matrix Market files. Would it be possible to just use the raw files from SuiteSparse Matrix Collection? My parser doesn't work on the tensor object type you use in these files, which seems to swap the row and column index.

willow-ahrens commented 4 months ago

I'll try to fix my matrixmarket today

willow-ahrens commented 4 months ago

@BenBrock I investigated this, and updated the matrixmarket writer to use matrix when applicable. Regarding row and column swapping, these mtx files match the index ordering used in the original files, so I don't think that's the issue?

BenBrock commented 4 months ago

Thanks, I can read the Matrix Market files, although I do still have a little problem with the rows/columns, which appear to me to be swapped in the Binsparse files.

Here's b1_ss.mtx as an example:

%%MatrixMarket matrix coordinate real general
7 7 15
5 1 -0.03599942
6 1 -0.0176371
7 1 -0.007721779
1 2 1.0
2 2 -1.0
1 3 1.0
3 3 -1.0
1 4 1.0
4 4 -1.0
2 5 0.45
5 5 1.0
3 6 0.1
6 6 1.0
*4 7 0.45* <- Let's look at this stored value as an example.
7 7 1.0

There's a stored value at (4, 7) one-indexed. That's (3, 6) zero-indexed, so that's what we should see in Binsparse. If I h5dump the included b1_ss_COO.bsp.h5 to compare, I see the following indices_0 (row indices) and indices_1 (column indices):

   DATASET "indices_0" {
      DATATYPE  H5T_STD_I64LE
      DATASPACE  SIMPLE { ( 15 ) / ( 15 ) }
      DATA {
      (0): 1, 2, 3, 1, 4, 2, 5, 3, *6*, 0, 4, 0, 5, 0, 6
      }
   }
   DATASET "indices_1" {
      DATATYPE  H5T_STD_I64LE
      DATASPACE  SIMPLE { ( 15 ) / ( 15 ) }
      DATA {
      (0): 0, 0, 0, 1, 1, 2, 2, 3, *3*, 4, 4, 5, 5, 6, 6
      }
   }

I've highlighted our indices with *, which are stored at indices_0[8] and indices_1[8]. For COOR, indices_0 holds row indices and indices_1 holds column indices, which means the Binsparse file stored (6, 3), not (3, 6) as it should be.

Not sure exactly where things are going wrong---maybe you're defaulting to COOC for COO instead of COOR?

willow-ahrens commented 4 months ago

Ah, I see. I think I caught it, it was on my side, the writer for COO was mixing up the numbering of the levels.