Open NHDaly opened 2 years ago
Hey @NHDaly, thanks for bringing this up. I went with Vector{UInt8}
because there are use cases for bytes
that don't involve any textual interpretation of the underlying data -- for example, in our RAI_Solver protos, there is a Node
message representing an element of a linked list and each node has a bytes
field called next
containing the raw binary payload of the rest of the list.
I believe there is semantic difference between strings and raw bytes and Julia is not strictly neutral when it comes to String
s and Vector{UInt8}
, e.g. indexing into a string should always return a valid grapheme ("ěš[1:2]"
fails), also mutability comes to mind. It depends on the actual semantics of the underlying data whether Vector{UInt8} or String is easier to work with, imo.
But to be honest, I primarily went with Vector{UInt8} based on intuition, now knowing that bytes
were primarily designed as a wider string type for certain languages. I would be happier with the change if there was a way to signal if the underlying data is text or not...
Any thoughts, @quinnj ?
I agree with keeping them separate. While Julia String
allows storing arbitrary data, as @Drvi mentioned, it's certainly not ergonomic as almost all string functions expect properly encoded utf8 and will fail/not work in unexpected ways on arbitrary data.
One primary use I see for keeping them separate are compressed values; the TranscodingStreams.jl framework has operations defined directly on Vector{UInt8}
, but not String
to allow efficient in-memory compresion/decompression.
It's also very efficient in Julia to call String(::Vector{UInt8})
as just the wrapper String
object is created and the underlying data isn't copied.
I think in C++ they're basically the same because C/C++ has never had a very strong notion of "strings" as distinct from char []
/char *
, i.e. byte arrays (since char
is just a byte and not like Julia's Char
).
I'm open to hearing other strong use-cases for interpreting bytes
as String
, but give the efficiency of String(bytes)
and the use-cases for having bytes
as Vector{UInt8}
, I'm inclined to keep the distinction.
Thanks for the explanations. Yeah, that all makes sense. 🤔 okay yeah i think those are compelling arguments to keep things how they currently are. 👍
I was just surprised to see this distinction, since i've always thought of julia's strings as more like C++'s and less like Java's/Go's. But if you think it is a valuable distinction, then it's fine with me too. I think we can work with it either way. I just wanted to produce the most ergonomic code possible, so i wanted to check in about it with you.
I think that we should decode
bytes
fields into aString
value in julia, not aVector{UInt8}
.I might be wrong about this, and we should discuss it, but I think so. For a quick TL;DR, this is what C++ does, and I think julia is more like C++ than any of the other languages. (https://developers.google.com/protocol-buffers/docs/proto3#scalar)
Here's my understanding of the situation:
\0
or\1
, or\xff
."hello\0world"
)strings
(guaranteed to be UTF8 encoded) frombytes
(arbitrary stream of bytes).SO I think that Julia strings are more similar to C++ strings than they are to Java's or the other languages. I don't think that there's really any benefit to the user to load the data into a
Vector{UInt8}
, and it makes thebytes
fields more cumbersome.What do you think? 😊 is there a reason that you decided to go with
Vector{UInt8}
here? What do you think about changing it?CC: @bachdavi, @mbravenboer