Drvi / ProtocolBuffers.jl

4 stars 0 forks source link

Should `bytes` be encoded as a `String` in julia, rather than a `Vector{UInt8}`? #14

Open NHDaly opened 2 years ago

NHDaly commented 2 years ago

I think that we should decode bytes fields into a String value in julia, not a Vector{UInt8}.

I might be wrong about this, and we should discuss it, but I think so. For a quick TL;DR, this is what C++ does, and I think julia is more like C++ than any of the other languages. (https://developers.google.com/protocol-buffers/docs/proto3#scalar)

Here's my understanding of the situation:

SO I think that Julia strings are more similar to C++ strings than they are to Java's or the other languages. I don't think that there's really any benefit to the user to load the data into a Vector{UInt8}, and it makes the bytes fields more cumbersome.

What do you think? 😊 is there a reason that you decided to go with Vector{UInt8} here? What do you think about changing it?

CC: @bachdavi, @mbravenboer

Drvi commented 2 years ago

Hey @NHDaly, thanks for bringing this up. I went with Vector{UInt8} because there are use cases for bytes that don't involve any textual interpretation of the underlying data -- for example, in our RAI_Solver protos, there is a Node message representing an element of a linked list and each node has a bytes field called next containing the raw binary payload of the rest of the list.

I believe there is semantic difference between strings and raw bytes and Julia is not strictly neutral when it comes to Strings and Vector{UInt8}, e.g. indexing into a string should always return a valid grapheme ("ěš[1:2]" fails), also mutability comes to mind. It depends on the actual semantics of the underlying data whether Vector{UInt8} or String is easier to work with, imo.

But to be honest, I primarily went with Vector{UInt8} based on intuition, now knowing that bytes were primarily designed as a wider string type for certain languages. I would be happier with the change if there was a way to signal if the underlying data is text or not...

Any thoughts, @quinnj ?

quinnj commented 2 years ago

I agree with keeping them separate. While Julia String allows storing arbitrary data, as @Drvi mentioned, it's certainly not ergonomic as almost all string functions expect properly encoded utf8 and will fail/not work in unexpected ways on arbitrary data.

One primary use I see for keeping them separate are compressed values; the TranscodingStreams.jl framework has operations defined directly on Vector{UInt8}, but not String to allow efficient in-memory compresion/decompression.

It's also very efficient in Julia to call String(::Vector{UInt8}) as just the wrapper String object is created and the underlying data isn't copied.

I think in C++ they're basically the same because C/C++ has never had a very strong notion of "strings" as distinct from char []/char *, i.e. byte arrays (since char is just a byte and not like Julia's Char).

I'm open to hearing other strong use-cases for interpreting bytes as String, but give the efficiency of String(bytes) and the use-cases for having bytes as Vector{UInt8}, I'm inclined to keep the distinction.

NHDaly commented 2 years ago

Thanks for the explanations. Yeah, that all makes sense. 🤔 okay yeah i think those are compelling arguments to keep things how they currently are. 👍

I was just surprised to see this distinction, since i've always thought of julia's strings as more like C++'s and less like Java's/Go's. But if you think it is a valuable distinction, then it's fine with me too. I think we can work with it either way. I just wanted to produce the most ergonomic code possible, so i wanted to check in about it with you.