Should `bytes` be encoded as a `String` in julia, rather than a `Vector{UInt8}`?

I think that we should decode bytes fields into a String value in julia, not a Vector{UInt8}.

I might be wrong about this, and we should discuss it, but I think so. For a quick TL;DR, this is what C++ does, and I think julia is more like C++ than any of the other languages. (https://developers.google.com/protocol-buffers/docs/proto3#scalar)

Here's my understanding of the situation:

Some languages, like Java, Go, and others, have an encoding requirement on their string type, such that it is illegal to store non-unicode values in the string. This means that your string cannot contain arbitrary byte values, such as \0 or \1, or \xff.
Julia and C++, on the other hand, have no such restriction. In julia and C++, a string is just a series of bytes. You as the user might know that a string value contains only unicode values, in which case it's safe to interpret it as such. But the language allows you to store arbitrary bytes in the string. (e.g. "hello\0world")
Since julia strings might contain arbitrary bytes, the julia string functions can all work with such data just fine. So e.g. displaying this string that was plucked from an image file is no issue:
```
julia> read("/Users/nathandaly/Downloads/whale-pod-corey-ford.jpg", String)[1:10]
"\xff\xd8\xff\xe0\0\x10JFIF"
```
My understanding is that, in order to accommodate these other languages, the Proto spec also distinguishes strings (guaranteed to be UTF8 encoded) from bytes (arbitrary stream of bytes).

But since C++ strings are arbitrary streams of bytes, in the C++ code, both types decode into the same C++ type: https://developers.google.com/protocol-buffers/docs/proto3#scalar .proto type	Notes	C++ type	Java type
string	A string must always contain UTF-8 encoded or 7-bit ASCII text, and cannot be longer than 232.	string	String
bytes	May contain any arbitrary sequence of bytes no longer than 232.	string	ByteString

SO I think that Julia strings are more similar to C++ strings than they are to Java's or the other languages. I don't think that there's really any benefit to the user to load the data into a Vector{UInt8}, and it makes the bytes fields more cumbersome.

What do you think? 😊 is there a reason that you decided to go with Vector{UInt8} here? What do you think about changing it?

CC: @bachdavi, @mbravenboer

Hey @NHDaly, thanks for bringing this up. I went with Vector{UInt8} because there are use cases for bytes that don't involve any textual interpretation of the underlying data -- for example, in our RAI_Solver protos, there is a Node message representing an element of a linked list and each node has a bytes field called next containing the raw binary payload of the rest of the list.

I believe there is semantic difference between strings and raw bytes and Julia is not strictly neutral when it comes to Strings and Vector{UInt8}, e.g. indexing into a string should always return a valid grapheme ("ěš[1:2]" fails), also mutability comes to mind. It depends on the actual semantics of the underlying data whether Vector{UInt8} or String is easier to work with, imo.

But to be honest, I primarily went with Vector{UInt8} based on intuition, now knowing that bytes were primarily designed as a wider string type for certain languages. I would be happier with the change if there was a way to signal if the underlying data is text or not...

Any thoughts, @quinnj ?

I agree with keeping them separate. While Julia String allows storing arbitrary data, as @Drvi mentioned, it's certainly not ergonomic as almost all string functions expect properly encoded utf8 and will fail/not work in unexpected ways on arbitrary data.

One primary use I see for keeping them separate are compressed values; the TranscodingStreams.jl framework has operations defined directly on Vector{UInt8}, but not String to allow efficient in-memory compresion/decompression.

It's also very efficient in Julia to call String(::Vector{UInt8}) as just the wrapper String object is created and the underlying data isn't copied.

I think in C++ they're basically the same because C/C++ has never had a very strong notion of "strings" as distinct from char []/char *, i.e. byte arrays (since char is just a byte and not like Julia's Char).

I'm open to hearing other strong use-cases for interpreting bytes as String, but give the efficiency of String(bytes) and the use-cases for having bytes as Vector{UInt8}, I'm inclined to keep the distinction.

Thanks for the explanations. Yeah, that all makes sense. 🤔 okay yeah i think those are compelling arguments to keep things how they currently are. 👍

I was just surprised to see this distinction, since i've always thought of julia's strings as more like C++'s and less like Java's/Go's. But if you think it is a valuable distinction, then it's fine with me too. I think we can work with it either way. I just wanted to produce the most ergonomic code possible, so i wanted to check in about it with you.

Drvi / ProtocolBuffers.jl

Should `bytes` be encoded as a `String` in julia, rather than a `Vector{UInt8}`? #14