Should the binary(De)Serializer be "unstructured" by default?

chapel-lang / chapel

a Productive Parallel Programming Language

https://chapel-lang.org

Other

1.78k stars 420 forks source link

Should the binary(De)Serializer be "unstructured" by default? #23712

Closed benharsh closed 10 months ago

benharsh commented 12 months ago

Currently the binarySerializer and binaryDeserializer types add some additional binary data for certain types to make it easier to read types, or to deal with errors. For example, classes begin with a single byte to indicate whether the value is nil, and strings are preceded by a variable-width integer value representing the string's length.

In the case where a user wishes to explicitly control the entirety of the binary output, these kinds of additional data may be undesirable. Furthermore, the only means of disabling this "structured" format is through an undocumented "_structured" argument to the (De)Serializers' initializers.

A few questions to consider:

1) Should we have a binary (De)Serializer that is explicitly unstructured, and allows/encourages users to write their binary data however they wish? 2) Should the existing binarySerializer and binaryDeserializer types fulfill such a purpose?

If not, should we create a different package module to support this use case? 3) If so, should such behavior be the default? 4) Should we stabilize the "_structured" argument, and rename to something more appropriate?

It should be noted that the existing "format" for binary(De)Serializer is quite simplistic and is very close to matching a hypothetical "unstructured" format. It might be worth considering making this change in the interest of simplicity, and spending more time developing a more robust binary format to satisfy the "easy to write, easy to read" out-of-the-box behavior that we were originally aiming for.

lydia-duncan commented 12 months ago

Without this byte, do we preserve the "nilability" knowledge of the type? E.g., if we wrote something that was nilable but not nil out and then read it back in, would we store the contents in a nilable instance of the class or a non-nilable instance?

I definitely think that having some stabilized way to control it sounds valuable. I think I'd have to think more about whether it should be a flag to the (de)serializer or a separate (de)serializer, though.

benharsh commented 12 months ago

Without this byte, do we preserve the "nilability" knowledge of the type? E.g., if we wrote something that was nilable but not nil out and then read it back in, would we store the contents in a nilable instance of the class or a non-nilable instance?

This byte isn't meant to represent the "nilable" aspect of the original type, but whether the value is "nil" or not. The result of reading a type depends on how the user wishes to read the type, which they indicate via the type or value they pass to fileReader.read.

lydia-duncan commented 12 months ago

Okay, cool, thanks for clarifying (and I think I understand why we don't want what I was asking about, too). My second paragraph still stands independent of that

mppf commented 11 months ago

IMO there are conflicting goals that probably need to be met by different binary serializer/deserializers:

general "if you wrote it you can read it back in"
no-frills "what you write is what you get"

Historically, when working on the pre-serializers binary format, I was more concerned with (1). Which is why it did things like output string lengths before a string. Otherwise, there is no way to read a string. (When would it stop?). However, in the current serializer/deserializer world, this kind of thing might fit better in a corresponding feature to "pickle".

In our I/O design discussions, I have been expecting that somebody who wanted such total control over the output wouldn't be using a serializer at all. That said, I don't think there's anything wrong with making a simplified binary serializer. I think it's much less clear if it's even possible to have a simplified binary deserializer (in particular - how can it ever read a string?).

psath commented 11 months ago

In our I/O design discussions, I have been expecting that somebody who wanted such total control over the output wouldn't be using a serializer at all. That said, I don't think there's anything wrong with making a simplified binary serializer. I think it's much less clear if it's even possible to have a simplified binary deserializer (in particular - how can it ever read a string?).

From what I can tell aren't the serialize/deserialize the formal replacement for readThis/writeThis though? In which case this change has introduced a regression to their previous no-frills behavior. (Unless you know about the _structured flag, you can't even read in data you generated yourself in Chapel 1.31.) Capturing all of a class's read/write semantics in these functions is a good design, people who want no-frills binary (and have had it in previous versions) shouldn't be locked out of that.

Describing the presence/absence or length of possibly-missing/dynamic-length fields (such as strings) as part of a fixed-sized user-defined binary header is pretty standard. That the user is responsible for self-managing flex-sized elements of their binary format is a normal assumption in the C-compatible world.

I agree that there does seem to be a conflict between "all such meta-descriptors must be manually user-defined" (the C-adjacent way that I'm expecting) and "the compiler/runtime may tag binary data with auto-generated descriptors" (I'm not a python programmer, but this sounds like my understanding of pickles). In the latter those cases need to be boldly documented in the respective serializer.

I can see a de novo Chapel application benefiting from (1), if it doesn't need file format compatibility with other languages (or they are ok with handling these auto-generated descriptors in the other languages). But (2) is a hard requirement for interop with Fortran/C/C++, where much of the HPC market still lives. So it may be the case that the two functionalities should be split, (1) can go into structuredBinary(De)Serializer and be the "recommended" approach, and (2) can go into rawBinary(De)Serializer for "interoperability and maximum control".

mppf commented 11 months ago

In which case this change has introduced a regression to their previous no-frills behavior. (Unless you know about the _structured flag, you can't even read in data you generated yourself in Chapel 1.31.)

I wouldn't describe it as a regression myself, since the previous behavior was not stable. But it is surely a behavior change and it's surely causing trouble for you. So we'd like to address the issue in some way.

But (2) is a hard requirement for interop with Fortran/C/C++, where much of the HPC market still lives.

I have been expecting that if you are working with a particular file format, you would either write your own myFileFormatSerializer/myFileFormatDeserializer (for reasonably general file formats like JSON, Toml, perhaps HDF5); or you would not use serializers at all. The I/O library has lots of functions that always do binary I/O (writeBinary writeBytes etc). Or, you would use the C interoperability mechanisms to work with an existing C library that can work with that file format.

Another option immediately available to you is to use low-level I/O calls like writeBinary / writeBytes etc within your own serialize function. (Mainly these are all the I/O calls other than write and writeln and writef("%?")). This will allow you to have complete control over how your types output at the expense of not allowing different serializers to work with them (and, in particular, if you just writeln your type, you will get a bunch of binary output). I suspect that this might be the most satisfying to you, in that you can keep the same structure you had before with writeThis / readThis and just switch to using these low-level calls in your functions there.

So it may be the case that the two functionalities should be split, (1) can go into structuredBinary(De)Serializer and be the "recommended" approach, and (2) can go into rawBinary(De)Serializer for "interoperability and maximum control".

I think that having the two functionalities split sounds like a good idea to me. What to call them, and what should be the behavior of the binarySerializer, I am not so sure of. I think it'd be nice to have the binary serializer/deserializer for "if you wrote it you can read it back in" be something we can evolve into handling class-based data types with cycles (and so it would not be stable for classes for a while). Meanwhile, the raw/simplistic/maximum control/no-frills serializer could have stable behavior soon. Perhaps we wouldn't offer corresponding deserializer, or perhaps it'd just throw if you ever try to read a string or bytes — (but, if it throws for those types, the only way to respond to that would be to write deserialize methods using readBinary etc to get string lengths and then the data type would no longer be able to be serialized and deserialized with different serializers/deserializers like JSON. So, IMO, that's not really better than asking people in this situation to use readBinary etc within their deserialization methods for their types.)

psath commented 11 months ago

I have been expecting that if you are working with a particular file format, you would either write your own myFileFormatSerializer/myFileFormatDeserializer (for reasonably general file formats like JSON, Toml, perhaps HDF5); or you would not use serializers at all. The I/O library has lots of functions that always do binary I/O (writeBinary writeBytes etc). Or, you would use the C interoperability mechanisms to work with an existing C library that can work with that file format.

If the answer to "can I use my existing data in Chapel?" is "not without writing C", then I have no incentive to leave C. If the answer is "not without understanding the language enough to write a custom (De)Serializer class" (which may be the better long-term option, sure), then you've cut off a swathe of potential new users with a hard roadblock.

Another option immediately available to you is to use low-level I/O calls like writeBinary / writeBytes etc within your own serialize function. (Mainly these are all the I/O calls other than write and writeln and writef("%?")). This will allow you to have complete control over how your types output at the expense of not allowing different serializers to work with them (and, in particular, if you just writeln your type, you will get a bunch of binary output). I suspect that this might be the most satisfying to you, in that you can keep the same structure you had before with writeThis / readThis and just switch to using these low-level calls in your functions there.

I can't even get into my class's custom deserialize function because the binaryDeserializer(endian=ioendian.little, _structured=true) throws a new runtime error expecting to read the nilable byte that does not exist in my 1.31-compatible file, before it hands control to my function. The problem is that it is expecting this byte before it yields to the user function.

mppf commented 11 months ago

If the answer to "can I use my existing data in Chapel?" is "not without writing C", then I have no incentive to leave C. If the answer is "not without understanding the language enough to write a custom (De)Serializer class" (which may be the better long-term option, sure), then you've cut off a swathe of potential new users with a hard roadblock.

To be clear, I was advocating for reusing an existing C library via interoperability features, rather than having to re-implement code to handle a file format. I was not advocating for writing C code to solve this problem.

I can't even get into my class's custom deserialize function because the binaryDeserializer(endian=ioendian.little, _structured=true) throws a new runtime error expecting to read the nilable byte that does not exist in my 1.31-compatible file, before it hands control to my function.

Good point. As a result, the strategy of using readBinary etc in your deserialization requires _structured=false, which is not documented or user facing. Which leads to the questions in the OP of this issue.

psath commented 11 months ago

To be clear, I was advocating for reusing an existing C library via interoperability features, rather than having to re-implement code to handle a file format. I was not advocating for writing C code to solve this problem.

Ahh, apologies for my misunderstanding. We are doing a test exercise in whole-code porting so we'd like to be able to port all the functionality into pure Chapel for an apples-to-apples productivity comparison. C library interop is an important feather in Chapel's cap for other use-cases though for sure!

kwaters4 commented 11 months ago

Should new users (New User: other language or Chapel < 1.32 user) be expected to know (by default) about the nil byte when they move to and work with (De)Serializers and know that there is an undocumented flag to work around Chapel's class nilability and other meta data? The underlying assumption being made is that binary data going forward will be written by the Chapel (De)Serializers or users must know about the "nilable byte", which seems unreasonable for newer users coming to the language. I would argue that most binary data would be created by other programs and then piped into the Chapel framework to start and maybe also in the long run, but that is from my prior experience.

I am on the fence here between having the _strucutured flag being named something more meaningful and documented or having separate methods for compatibility and newer users. I understand the desire for users to write their own frameworks, but looking at adoption from already established code bases this would be a hurdle and something that could be pushed down the road after an initial testing phase.

I do not have experience yet in the newer framework that was released in 1.32, but I wanted to understand this issue while the discussion was still fresh and make sure I understand the discussion at hand. Please let me know if I am off the mark here.

bradcray commented 11 months ago

Just summarizing what others have said to see if we're getting onto the same page:

I think we're hearing consistent feedback from @psath and @kwaters4 that there exists an important need for a "raw" binary serializer/deserializer like the no-frills option as Michael calls it, with the major goal of being able to read in binary data created in traditional languages or other no-frills serializers (like, arguably, Chapel 1.31 and earlier) with ease. This seems like something we should prioritize for 1.33 which is coming up fast (release date will likely be mid-December). @psath, would waiting for that release be a hardship for you in any way? (e.g., would you advocate for a point release or benefit from a patch between now and then?)

As Michael notes, what the "default" binarySerializer/Deserializer should do is less clear to me. Other than the fact that they involve longer identifiers, I like Paul's route of having more descriptive names like rawBinarySerializer and structuredBinarySerializer (or maybe rawBinSer and structBinSer?). Another potential option would be to have a required argument to binarySerializer() saying which style to use— @benharsh, since you're the IO framework expert, would such an approach (if we liked it from a UI standpoint) be annoying from the implementation/maintenance point of view (in terms of requiring disparate things to be mixed more closely in the code than ideal? Or could things be structured such that the option essentially branched to two distinct types, each in their own files, etc.?)

Kevin and Paul: As the users on this issue so far, your input on naming and ergonomics in how to choose between these options in code would be particularly valuable.

Paul's point about not being able to deal with the null byte prior to the user methods in the framework being called also stood out for me and seemed like an obvious argument for exposing the _structured flag in some way, though how that is done seems as though it's not as immediately pressing as the raw format support. Specifically, I'm wondering whether we'll want a variety of structured-style knobs for things other than just class nilability and that this is just the first we've run into? But again, it feels like maybe we have some time to wrestle with that question if the structured format is not as crucial for 1.33?

Michael's point about the structured ser/deser being related to the pickling notion we've discussed also resonated with me in the sense of supporting the theme of "dump this out so I can read it back in and get the exact same thing without worrying about anything". So I'd like to suggest that rather than thinking of there as being three binary formats—raw, structured (as it exists today), and pickled (as we might envision it)—maybe the right thing is to merge the pickled and structured formats? (since, arguably, a structured format that didn't de-duplicate class instances had multiple pointees would still be losing some important structure?).

psath commented 11 months ago

Just summarizing what others have said to see if we're getting onto the same page:

I think we're hearing consistent feedback from @psath and @kwaters4 that there exists an important need for a "raw" binary serializer/deserializer like the no-frills option as Michael calls it, with the major goal of being able to read in binary data created in traditional languages or other no-frills serializers (like, arguably, Chapel 1.31 and earlier) with ease. This seems like something we should prioritize for 1.33 which is coming up fast (release date will likely be mid-December). @psath, would waiting for that release be a hardship for you in any way? (e.g., would you advocate for a point release or benefit from a patch between now and then?)

Just a quick response to the above, I'll give the remainder more thought.

With _structured=false I am able to achieve functionally-equivalent (or at least from all on-disk appearances) behavior to pre-1.32. So that'll tide me over til 1.33. If there are important changes that you'd like checked out earlier, I'm set up to build from GitHub as well.

benharsh commented 11 months ago

Another potential option would be to have a required argument to binarySerializer() saying which style to use— @benharsh, since you're the IO framework expert, would such an approach (if we liked it from a UI standpoint) be annoying from the implementation/maintenance point of view

At the moment the difference between the "raw" and "structured" formats is small enough code-wise that it's not annoying from an implementation point of view. The greater that difference, the more challenging and error-prone it would be to maintain both formats within a single type. I suppose if that argument were a param value then we could use where clauses to mitigate those issues, but it's not clear that we should require making that decision at compile-time.

My current thinking is leaning towards two distinct types, which seem easier to reason about and work with despite the longer identifiers (which users can mitigate via import renaming or type aliases if necessary). For 1.33, a possible path could be to deprecate binary(De)Serializer in favor of two new distinct sets of types (e.g. rawBinarySerializer, structuredBinarySerializer). If we want to keep binary(De)Serializer then I think we should definitely standardize and document a structured argument.

mppf commented 11 months ago

Re naming, here are some ideas:

raw, simple, basic, minimal
structured, complete, full

kwaters4 commented 11 months ago

There may be some merit for something pertaining to a chapel format.

native and non-native prefixes.

benharsh commented 11 months ago

I think a fair amount of the discussion so far has concerned the nilable-byte preceding classes, but I wanted to check in on opinions regarding strings. With iokind.native the past behavior resulted in no leading bytes representing length, meaning they could not be read back in without some manual intervention on the user's part. I have been assuming that a "raw" binary serializer would also not include such leading bytes. Does that match everyone else's understanding?

Also, what would others like to have happen when trying to serialize or deserialize a string in this situation? I think there are a few possible ways to approach this:

Like iokind, either write the raw data, or read all of the file's remaining bytes into the string
Throw an error when trying to serialize or deserialize a string, directing users to the variants of readBinary/writeBinary that take a ``string
For simplicity, only throw when deserializing, but allow fileWriter.write to just serialize the raw string data.

Thoughts?

psath commented 11 months ago

Just getting a chance to catch up with the conversation

@kwaters4

There may be some merit for something pertaining to a chapel format.

native and non-native prefixes.

Do you mean native ~= chapel and non-native ~= C/Fortran/Unstructured? Or something else?

If the former, to me that causes a little confusion with the use of term native in the endianness context, which means "the machine's default". To me that would imply native ~= "lowest-level/no-hand-holding" i.e. (C/Fortran/Unstructured).

In which case I'd use the native and chapel monikers for today's _structured=false and =true, respectively.

Regarding pickle-like vs Chapel-structured: I don't really have a horse in that race. I'm going to continue with whatever the "raw" equivalent is for language interoperability.

I can definitely see the utility for new or Chapel-exclusive developers having a quick, easy, storage-dense binary-like format that reassembles without manual intervention and headers of a "raw" format. Don't do python so don't have an opinion on whether there should be two different structured options, or just one.

Regarding strings: I'm not currently doing any binary string IO, nor have I historically done much with any format other than ASCII (no UTF-16, etc.) in C-like languages, so I'll generally assume the existence of a 1-byte null terminator. Assuming we move things forward with separate rawBinSer and structBinSer I'd probably expect something like this:

Generally I'd assume anything from CTypes would behave like C (no leading bytes, presence of null-terminators, etc.)
read string via rawBinSer: No leading bytes. If no size is provided, read until first null terminator or EOF. If size is provided, read until the lesser of (size, first EOF), including and continuing beyond null terminators. In either case, update size with the actual number of characters read. Only throw unexpectedEOF if both a size is provided, and EOF is encountered before reading size bytes. If no size is provided, don't throw, even on empty strings.
write string via rawBinSer: Write the lesser of the string size or user-requested size without any leading descriptor. Only throw if the file cannot be extended for some filesystem / special-file reason. If the string contains null terminators, write them anyway, because the above can handle that case (assuming the user has specified size through some other mechanism).
read string via structBinSer: Assume the presence of leading size descriptor variable (and possibly a format descriptor, if Chapel ever supports beyond UTF-8). Throw something about "no string descriptor" if EOF is found before the descriptor. Read the minimum of (user provided size, on-disk size descriptor, EOF), not stopping at null terminators. Update the value of size with actual bytes read. Throw unexpectedEOF if an EOF is encountered before either user-provided or on-disk size. Do not throw if on-disk size descriptor is less than user-requested, they can/should check the returned size for this case.
write string via structBinSer: Write the minimum of the string's actual size, and the user-requested size as the on-disk descriptor. Then write that many bytes and update the size variable accordingly. Only throw if the file cannot be extended for some filesystem / special-file reason. If such a reason occurs, I would not expect the on-disk descriptor to be retroactively modified in between the throwing condition being met and control returning to the caller's catch. (Though it could be a nice hand-hold to do that and thus avoid unexpectedEOF on subsequent reads of that file, but would need to be documented so users know to expect that.)

benharsh commented 11 months ago

Historically using iokind has not written strings with a null terminator. Here's an example program:

use IO;

proc main() {
  var f = new file(1);
  var w = f.writer(iokind.native);
  var x = "hello";
  w.writeln(x);
}

With the following output when run with ./a.out | hexdump -C:

00000000  68 65 6c 6c 6f 0a                                 |hello.|
00000006

Reading strings with iokind by default ignored null terminators, and would read until the end of the file.

For complete iokind compatibility and increased user flexibility, we could continue forward with this behavior for the "raw" binary serializer in the case that there are users who do not want null bytes to terminate a string. For convenience, we could also add an optional argument to enable reading by null-terminating strings by default, e.g., new binarySerializer(nullTerminating=true).

I'll also add that this is for the default behavior of the read and write methods on fileReader and fileWriter, and the standard string type. We have other methods that can be more discerning about how much of a string to read in terms of size or a terminator.

bradcray commented 11 months ago

@psath : Thanks for the feedback here. I think a few tricky things about relying on NULL-termination as a default read behavior is that we should probably discuss handling of reads of bytes values in addition to string values, and those can have internal NULL bytes (so NULL-termination is not a good signifier for end-of-value). Even in the string case, relying on a NULL terminator makes me slightly uneasy since Chapel strings don't rely on NULL-termination for their in-memory format (currently, at least).

Both these facts make me more inclined to either support the naive behavior that binary iokind traditionally has, even though it's asymmetric and arguably naive (write string/bytes with no special size or EOF; read string/bytes through EOF). Or to simply throw up our hands and say that binary read/write of strings/bytes is not supported by the raw binary ser/deser type (or only read is not).

@benharsh : Paul has indicated that he isn't relying on binary read/write of strings. When you were working on updating tests and codes, did you find any other cases that were?

To Paul's point about CTypes, I could imagine having a special behavior for a read/write of c_ptr(c_char) to take NULL termination into account, but I know that people have balked at special treatment of that pointer type in other contexts for the assumption that reading/writing a char* would necessarily imply something string-y / NULL-terminated when nothing in C requires that to be the case other than convention and common usage.

benharsh commented 11 months ago

@benharsh : Paul has indicated that he isn't relying on binary read/write of strings. When you were working on updating tests and codes, did you find any other cases that were?

Sorry for the late response: There are some older CLBG benchmarks in the submitted directory that were relying on old iokind behavior to some degree. Otherwise the only codes I could find were artificial tests designed for binary IO and strings.

benharsh commented 10 months ago

After some internal discussion, we're planning on pursuing the following changes:

switch binarySerializer and binaryDeserializer to be "unstructured" by default
- Issue a compiler warning in 1.33 to indicate this behavior has changed, and add a config-param to disable the warning
introduce an unstable package module to provide the old "structured" behavior, so that we can make improvements over time
given that this is meant to be a purely "raw" binary (de)serializer, we will throw when trying to read string or bytes types since those types will not be readable in a useful way due to the lack of a length or terminator.
- writing such types will not throw an error
- we can add a nullTerminated=true optional argument in the future if desired
- context: we felt that the old "read to EOF" behavior from iokind was not useful in practice
- users can achieve the same behavior with existing methods on fileReader dedicated to binary IO if desired

bradcray commented 10 months ago

@benharsh : I think this can be closed now, thanks to https://github.com/chapel-lang/chapel/pull/23932 and friends. Does that seem right?