DrTimothyAldenDavis / GraphBLAS

SuiteSparse:GraphBLAS: graph algorithms in the language of linear algebra. For production: (default) STABLE branch. Code development: ask me for the right branch before submitting a PR. video intro: https://youtu.be/Tj5y6d7FegI .
http://faculty.cse.tamu.edu/davis/GraphBLAS.html
Other
354 stars 62 forks source link

Set name of UDT when serializing #146

Open eriknw opened 2 years ago

eriknw commented 2 years ago

When using GxB_*_serialize, I would like the option to set the name of a user-defined type to be used in the output blob. This is for maximum portability.

A sufficient solution for me would be to have the ability to change the name of a UDT after it's been created. Maybe add GxB_Type_rename? This doesn't require changing the signature of the serialize functions,

p.s. What is GxB_serialize_free mentioned here? https://github.com/DrTimothyAldenDavis/GraphBLAS/blob/1bf1dde4671775f4cc7c6e80b92d94199c9ba33b/Include/GraphBLAS.h#L11281

DrTimothyAldenDavis commented 2 years ago

The reference to GxB_serialize_free is a mistake. It should just read "free(blob)", or more precisely, it should be freed by the same free method passed to GxB_Init. I'll fix the comment, and explain in more detail in the user guide.

Changing the name of a user-defined type would be fine for now, but it would destabilize a future version of GraphBLAS. I will be using that name to create GPU and CPU kernels with a jit, and if the name changes, the jitified kernel will break. So GxB_Type_rename is not a good idea.

eriknw commented 2 years ago

Understood, thanks.

There are a few reasons why being able to set the name of the UDT when serializing would be nice for us:

The last one can come up in python-graphblas if users register multiple UDTs that are actually the same underlying type but with different names, in which case we only ever register one with SuiteSparse:GraphBLAS. This was done for technical/usability reasons, so maybe we could work around those and create a new UDT for each name. Even so, there may be other reasons users would want to be able to control the serialization name of UDTs.

In the future, will UDT names need to be valid identifiers? I sure hope not. What would be the cost of changing the name in the future with JIT? Would things need to be rejitted? Why not give it an internal unique identifier for such a purpose (would type_defn work?) and let the name be user-facing and modifiable?

DrTimothyAldenDavis commented 2 years ago

The user-defined type names should be a valid identifier that can be used as in the string "wildtype" for this:

typedef struct
{
    float stuff [4][4] ;
    char whatstuff [64] ;
}
wildtype ;

For that type, the name should be the string 'wildtype'. I need a valid identifier to name the type if I am to create a valid kernel that uses the type internally.

So the type can't include any spaces, and can't start with a digit, for example. No semicolons, dots, or whatever.

Constructing my own internal typedef name would be really hard. I need a consistent name to communicate with the user, to return to GxB_Matrix_type for example, and to manage types in JIT'ed kernels and such.

eriknw commented 2 years ago

I see. My bad again for not reading the docs thoroughly.

I guess a more general goal would be to allow users to include extra metadata (perhaps as a string) when serializing an object, and to be able to retrieve this extra metadata when deserializing. This would give me a place--as part of your file format--to include a description of the datatype, which can be used to create the new type in Python if necessary.

eriknw commented 2 years ago

Requiring the name to be a valid identifier really is kind of a pain. In Python, making new UDTs is as easy as:

# Simple struct
>>> gb.dtypes.lookup_dtype({'x': int, 'y': float}).name
"{'x': INT64, 'y': FP64}"

# More complicated (datetime, fixed-length array, and fixed-length string)
>>> gb.dtypes.lookup_dtype({'x': np.datetime64, 'y': np.dtype((int, 3)), 'z': 'S5'}).name
"{'x': dtype('<M8'), 'y': INT64[3], 'z': dtype('S5')}"

This shows default names that we use. Being flexible with dtypes allows us to easily interoperate with other libraries in the Python ecosystem. It would be a burden to require users to give a name that's a valid identifier.

If we are to use the JIT with UDTs in a future version of SuiteSparse:GraphBLAS, I think we'll need to generate our own automatic names to give to SuiteSparse, such as "PythonUDT0", "PythonUDT1", etc. This seems non-ideal and makes it super-hard to rely on your serialization (and I like your serialization!).

DrTimothyAldenDavis commented 2 years ago

I'll keep this in mind. The first place where I'll introduce a JIT is for the CUDA kernels, for built-in types, and then user-defined types after that. The typedef, as an entire string, and a valid type identifier, will both be required:

typedef struct
{
    float stuff [4][4] ;
    char whatstuff [64] ;
}
wildtype ;
...
GrB_Type MyWildType ;
GxB_Type_new (&MyWildType, sizeof (wildtype), "wildtype",
"typedef struct { float stuff [4][4] ; char whatstuff [64] ; } wildtype ;") ;

where the "typedef" and trailing identifier "wildtype ;" might not be strictly required. I haven't written this yet but that is my plan. I will need both strings: the typedef (as a proper C typedef not a python typedef), and a valid identifier, so that I can inject them into a file, in C / C++, which I can then compile as a CUDA kernel or as a C-based OpenMP kernel.

eriknw commented 2 years ago

Thanks for the consideration. I'm looking forward to a JIT, although I'm a little paranoid about struct alignment.

imho, it may be reasonable to distinguish between a name, which is flexible, and an identity, which must be a valid identifier. The former is user-facing, the latter compiler-facing. The name can be the identifier by default.

For example, if one is naming an array dtype, it would be natural to name it something like "INT64[8]" or "int64_t[8, 16]". I've worked with teams that take naming semantics very seriously.

DrTimothyAldenDavis commented 2 years ago

That might be feasible. Currently, my GxB_Type_new method has 2 parameters: the type name (a valid identified) and the type definition (a string with the typedef in C syntax:

#define GxB_MAX_NAME_LEN 128
GrB_Info GxB_Type_new           // create a new named GraphBLAS type
(
    GrB_Type *type,             // handle of user type to create
    size_t sizeof_ctype,        // size = sizeof (ctype) of the C type
    const char *type_name,      // name of the type (max 128 characters)
    const char *type_defn       // typedef for the type (no max length)
) ;

Internally, I hold this object as follows:

struct GB_Type_opaque       // content of GrB_Type
{
    int64_t magic ;         // for detecting uninitialized objects
    size_t header_size ;    // size of the malloc'd block for this struct, or 0
    size_t size ;           // size of the type
    GB_Type_code code ;     // the type code
    char name [GxB_MAX_NAME_LEN] ;       // name of the type
    char *defn ;            // type definition (currently unused)
} ;

The Type->defn field is currently unused, but I'll use in the future GPU and CPU JITs. The Type->name is fixed in size, to 128 characters max (including the NULL terminator). The Type->name field is returned to the caller via GxB_Type_name, and it will also be used in the JITs. The Type->defn is unbounded in size (I malloc it and keep a copy). I do not yet have a method for returning the Type->defn but I will add that once I start using it. Perhaps GxB_Type_defn could be a method that returns the full definition of the type, as a string.

There are 2 possible solutions for your case. One approach would be to include a comment in the Type->defn, so the string passed in could be

"/* int64_t[8, 16] */ typedef struct { int64_t stuff ; int64_t morestuff [2] ; } myint8_16_type ;"

That would solve 2 problems: the same Type->defn could be parsed (by the python interface) to extract the user-preferred name (which need not be an identifier), and I can still use this Type->defn in my JIT'd kernels. It would not change my API. Would that work? You could even include more python-specific details in the C comments of the typedef string. There would be no limit to its size, since I just scan the string for its string length, and malloc accordingly. My JIT will just treat this string as a black-box, by just handing it over to the compiler after inserting it into the JIT'd C or CUDA kernel.

You could even pass in the following as a single string for my defn parameter to GxB_Type_new, with the exact python definition given to me as a C comment:

/*
gb.dtypes.lookup_dtype({'x': np.datetime64, 'y': np.dtype((int, 3)), 'z': 'S5'}).name
"{'x': dtype('<M8'), 'y': INT64[3], 'z': dtype('S5')}"
*/
typedef struct { datetime64 x ; int64_t y [3] ; double z ; } mypythontype42 ;

The other approach would be to add another parameter to GxB_Type_new, so I would have a type name (any string) a type identifier (a valid C identifier) and a type definition (a string with a valid C typedef). That would be an API-breaking change, however, since I already have GxB_Type_new defined in its current form, for my future JIT.

DrTimothyAldenDavis commented 2 years ago

Regarding struct alignment: I will simply be trusting the C compiler to do whatever it normally does for a struct. I can't write a compiler to parse the typedef so I have to trust the C compiler to define it, and also to compile the C functions given to GxB_BinaryOp_new, etc.

eriknw commented 2 years ago

Thanks again for your time. Yeah, both of those options probably work for me. Does or will your serialization include the type_defn string? If so, then, yes, both solutions work for me. (Also, does your serialization have a version number?).

If users don't give me a valid identifier to name a UDT, I'll just give it one myself.

DrTimothyAldenDavis commented 2 years ago

My serialization does not yet include the type_defn string, but I could add it. The blob does include a version number. I currently don't check it since any of my serialized blobs written by any of my GrB library versions can be read/written by any other GrB library version.

If I were to add the type_defn string, I could do so in an upper-ward compatible way, where (say) SS:GraphBLAS v7.2 would write the blob with the type_defn included. That blob could be read by any of my SS:GraphBLAS versions; they would just ignore the added type_defn. But v7.2.0 could then read back the type_defn.

I do include the type identifier, as an uncompressed string.

DrTimothyAldenDavis commented 1 year ago

The GrB_get/set method can set an arbitrary string as the GrB_NAME of a type. It can be the entire type definition if you like. I don't need the name for the JIT, just for get/set. Will this resolve the issue?

I've added both the matrix name and the GrB_ELTYPE_STRING (the GrB_NAME of the type of the matrix) to the serialized blob, in the draft v8.1.0.