crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.45k stars 1.62k forks source link

Literals for Slice(UInt8) #2886

Open asterite opened 8 years ago

asterite commented 8 years ago

We need a way to express binary data embedded in the data section of the program. We can do this right now for strings, but there's no way to create a non-UTF8 string with a string literal.

There are several ways we can fix this:

  1. We can add back the \x... escape to string literals, to add a byte with a specific hexadecimal value. Right now strings can hold non-UTF8 data, they just raise when using those strings as UTF-8 data (for example, iterating them), so it's strange that they can hold non-UTF8 data but one can't create them with a literal. From there, one could take a slice. This will also solve #2565 because inspecting a string with non-valid codepoints will output \x... for those values.
  2. We add a literal for something like a Slice(UInt8). It could just be Slice(UInt8), but these are not read-only. Or maybe they can be read-only and they can crash the program when written. One shouldn't write them, the same way as one doesn't get a slice from a string literal and writes to it. There was the idea of introducing const [...] for this, with which we could create static data for any kind of integer value.
  3. Other options...?

This doesn't have a big priority right now, but I'm leaving it here so there's a place to discuss this.

jhass commented 8 years ago

I would tend towards 2 with something like https://github.com/crystal-lang/crystal/pull/2791#issuecomment-225299043 as the preferred alternative. Either way we need to make sure to not run into issues similar to #2485.

ysbaddaden commented 8 years ago

Same here: Slice(UInt8) is the de-facto type for binary data whereas String may only contain UTF-8 data. I don't think it's a good idea to push the idea that it's okay to put arbitrary bytes into a String.

inspecting a string with non-valid codepoints will output \x... for those values

But I like that.

asterite commented 8 years ago

One issue we found the other day is that we needed to do a POST in the http client with binary data. We made it work by simply creating a String with that data and then invoking HTTP::Client.post. I think I like that, it's pretty convenient. Otherwise we'd need to add overloads or restrictions for Slice(UInt8), and HTTP::Request will have the body as String | Slice(UInt8), etc.

To compare with other statically typed languaged, Go's strings are also just byte chunks that can hold arbitrary bytes, but can also be treated as UTF-8 strings when needed: https://blog.golang.org/strings

Java's String class is supposed to be UTF-16, but can hold arbitrary bytes as well.

jhass commented 8 years ago

I very much like that String is supposed to handle UTF-8 valid data and operations on that. And nothing else. I would hate to loose that property and rather prefer convenience API added to other interfaces for handling Slice(UInt8). In the HTTP example binary data would need some form of content encoding to valid ASCII values anyway. Detecting to do that encoding automatically upon receiving a Slice(UInt8) vs a String seems actually easier than always second guessing whether String needs it or not.

bcardiff commented 8 years ago

I am with @jhass here. I would keep String as valid UTF-8.

I would rather add overloads in the http client to send/receive blobs. And I definitely want to be able to embed binary resources (Slice or StaticArray and then a convenient api to wrap it)

mperham commented 8 years ago

Would it be helpful or more performant to have Base64.decode(str, io) : Nil so I can decode the asset and stream it out with the response?

asterite commented 8 years ago

@mperham Good idea, an overload that writes directly to an IO is missing. Should be easy to add.

mperham commented 8 years ago

Just a side note, I'm trying to write the Slice(UInt8) out to the Kemal response:

    def self.serve(filename, resp)
      resp.status_code = 200
      resp.write Base64.decode(WEB_ASSETS[filename])
    end

I verified that the Slice size is exactly the same size as the file on disk but the response only has about half the expected bytes. Anyone know why the server response is not writing the entire Slice to the client?

asterite commented 8 years ago

@mperham we'd probably need a concrete code that we can reproduce to check if something works wrong. I tried creating a slice of 5000~50000 bytes and it works well.

mperham commented 8 years ago

Looks like the problem is related to me not setting the content-type header. The browser prints out the PNG contents as text/html but serves it correctly when I set it to "application/octet-stream".

ozra commented 8 years ago

Just throwing thoughts in to the mixture here: How about a literal that generates a View(UInt8) which would be a read only type derived from Slice(UInt8)? If it's known at compile time an area is unwritable, we should be helped at compile time, avoiding a crash where possible.

david50407 commented 8 years ago

How about provide users to create their own literal types (maybe in %data{ ... } format, data can be any words for each type, and {} can be [] or ()) like C++11 does?

Then we can create some custom literals for Slice(UInt8), StaticArray(UInt8) or other types we want? (use macro to define these works in compile-time, maybe?)

maxpowa commented 7 years ago

Any progress on this? Usecase in my scenario is writing bytes to an IO, as one might do when using low level packets on the wire. io.write_bytes(0x00000000, IO::ByteFormat::BigEndian) doesn't provide 4 empty bytes as one might expect, but rather outputs a single empty byte.

JacobUb commented 7 years ago

@maxpowa It works for me 😕

io = IO::Memory.new
io.write UInt8.slice(1, 1, 1, 1, 1, 1, 1, 1)
io.rewind
io.write_bytes(0x00000000, IO::ByteFormat::BigEndian)
io.to_slice # Bytes[0, 0, 0, 0, 1, 1, 1, 1]
maxpowa commented 7 years ago

Yep nevermind, it is indeed working... I must have done something wrong when I was testing. Thanks @Exilor

oprypin commented 7 years ago

We can add back the \x... escape to string literals

This has been implemented in https://github.com/crystal-lang/crystal/commit/cd8296b88d7859b8f914a0d4bf55f7c5534c5b15, by the way.

I think it's a really bad idea to allow broken string literals in the language's core syntax. I noticed that some people are already doing hideous things with it, without really understanding the situation...
This should only be possible through an unsafe operation.

The alternative solution is the way to go. Bytes literals should definitely be a thing.

And "\xff" syntax should give an explicit error like "strings are for UTF-8 encoded text, not for arbitrary bytes".

Side note: in Python "\x**" means "\u{**}", but they do have bytes literals where it means what you'd expect: b'\xff'

bararchy commented 7 years ago

@oprypin but sometimes people need to do hideous things for hideous causes :) this feature is important, it's heavily relaid on in fuzzers and exploit development (yes FFS using Crystal ! :) ) https://www.offensive-security.com/metasploit-unleashed/shell/

 def exploit
        connected = connect_login
        nopes = "\x90"*(payload_space-payload.encoded.length) # to be fixed with make_nops()
        sjump = "\xEB\xF9\x90\x90"     # Jmp Back
        njump = "\xE9\xDD\xD7\xFF\xFF" # And Back Again Baby  ;)         
        evil = nopes + payload.encoded + njump + sjump + [target.ret].pack("A3")
        print_status("Sending payload")
        sploit = '0002 LIST () "/' + evil + '" "PWNED"' + "\r\n"
        sock.put(sploit)
        handler
        disconnect
    end

etc....

RX14 commented 7 years ago

It's still easy enough to construct a string with invalid data, I just don't think it should be part of the syntax.

oprypin commented 7 years ago

@bararchy, thanks for a good demonstration of the point I was making... All of these should have been Bytes

oprypin commented 7 years ago

I forgot that this issue existed and just started writing a new one. Anyway... I'm just still appalled that there's a literal for invalid strings.

So, ping

Putting bytes literals in read-only data is a must-have, and so if the literal produces a writable Slice(UInt8), that's a problem. Or it used to be, not anymore! Now we even have read-only slices. So there are really no blockers now.

asterite commented 7 years ago

Right now this is solved because one can use a String for this, because a String can now have arbitrary bytes.

I know it's not the most elegant solution, but for now it works. We can postpone a real solution for this for later.

oprypin commented 7 years ago

pls

asterite commented 7 years ago

What if we add:

b"some content"

For now that would be equivalent to:

"some content".to_slice

and of course you can use \xAB for specific byte values.

We could also have:

b'x'

to be the same as 'x'.ord.to_u8 and not have it compile if it doesn't fit in an UInt8, so that would be a byte literal.

I think Rust uses the same notation.

oprypin commented 7 years ago

My suggestion that I started to write:


It would be a literal that does not allow \u escapes, and allows only ASCII characters, supplemented by the \xff syntax for arbitrary bytes. The literal would produce Slice(UInt8).

I propose the syntax b"foo\x12fsdfg", like in Python and Rust.

Side note, Bytes[] macro probably should be rewritten to produce a literal.

I would also suggest removing the hexadecimal notation from strings. Obviously, to replace the use case, the bytes literal would need to store the data in the read-only data section. I don't know whether that means that the size of the slice would need to be moved there as well, like it is with strings.


"some content".to_slice is impossible to do if hexadecimal escapes are removed from strings, which is the main problem I have

asterite commented 7 years ago

Oh, with "some content".to_slice I meant it would be equivalent to that. We could probably type b"hello" as a read-only Slice(UInt8) and put that in the ROM section of the program.

For that we'll probably need Slice to be part of the known types for the compiler, and have @pointer, @size and @read_only laid out accordingly in memory.

But for now I'd leave the ability to have \x.. escape sequences in a String. Later we can remove them, but we'll have to make sure that there's no way to create strings that are not valid in UTF-8. Maybe that will slow down everything a bit, but, well, correct code is better than fast code.

oprypin commented 7 years ago

@pointer, @size and @read_only could be directly followed by the data itself, with @pointer being equal to its own address + offset.

I don't think it's that important to prevent strings that are not valid UTF-8. The only way to create them is String.new(bytes or pointer), just raise the awareness. The problem is that people see nothing wrong with '\x' string literals and then intentionally seek out a way to recreate such strings "programmatically".

RX14 commented 7 years ago

@asterite We wouldn't need to know about Slice's internal layout, unlike String, because we can simply define that Slice needs to have a constructor taking a pointer and a size (which we already have). We only need to know String's layout because we put the data contiguously. With Slice we don't need to do that. And it's probably not worth doing it since it's a struct and LLVM will optimize since both the constructor arguments are literals.

oprypin commented 7 years ago

@RX14, are you sure you understand the part about putting this in read-only data section?

RX14 commented 7 years ago

@oprypin yes.... you pass a pointer to the data in the RO section to the slice contructor. The slice instance itself has to live on the stack anyway, so can't be in ROdata.

oprypin commented 6 years ago

@RX14 Please reopen this

straight-shoota commented 6 years ago

Why was this even closed and all those other issues which are most definitely not fixed?

HertzDevil commented 2 years ago

I would go one step further and use a completely new syntax similar to Elixir's bitstrings, rather than simply borrowing the one for string literals:

<<0x12>>             # => <<0x12>>
<<0x21>>             # => "!"
<<0xCF, 0x83>>       # => "σ"
"\xCF\x83"           # => "σ"
<<0x12, 0xCF, 0x83>> # => <<0x12, 0xCF, 0x83>>
"\x12\xCF\x83"       # => <<0x12, 0xCF, 0x83>>
<<0x12, "σ">>        # => <<0x12, 0xCF, 0x83>>

(Every double-quoted string literal in Elixir denotes a bitstring. Single-quoted ones produce charlists.)

An attractive feature about them is they can handle multibyte sequences:

<<0x12345678::32>>        # => <<18, 52, 86, 120>>
<<0x12345678::32-little>> # => <<18, 52, 86, 120>>

<<1.0::little>>    # => <<0, 0, 0, 0, 0, 0, 240, 63>>
<<1.0::32-little>> # => <<0, 0, 128, 63>>

<<0xCF83::16>>        # => "σ"
<<0x83CF::16-little>> # => "σ"
<<"σ"::utf8>>         # => "σ"
<<"σ"::utf16-little>> # => <<195, 3>>
<<0x03C3::utf8>>      # => "σ"
<<0x03C3::utf16-big>> # => <<3, 195>>

It emphasizes the fact that byte arrays are a more general concept than string-like byte sequences.

It is important that both the Bytes itself and the data it refers to are stored in read-only memory; the Slice constructor that accepts a pointer is unsafe, so the data must be encapsulated behind a read-only Bytes, with no other way to access it.

If we have an extremely fast String#valid_encoding?, say even faster than #each_char(&), then the performance penalties should be very minimal. So as a starter I think we should incorporate one of the algorithms in #11873. (In fact, the standard library has never used that method since its introduction.)

philipp-kempgen commented 7 months ago

Just to throw yet another an idea in: Ruby has the .b method for strings. https://docs.ruby-lang.org/en/3.2/String.html#method-i-b Maybe "bytestring\x00\x01".b could be treated as a byteslice literal in Crystal? (I prefer to say "a ByteSlice" rather than "a Bytes".)

Ruby also has ?… for character literals (or rather single-character strings), even supporting control characters. https://docs.ruby-lang.org/en/3.2/syntax/literals_rdoc.html#label-Strings

?\C-g == ?\a  # => true

Then again, b"…" and b'…' or Elixir bitstrings are probably better, if they could maybe use b(…) or b[…] or %b(…) instead of <<…>>, provided they let you write things like:

b( "filemagic", 0x01, 0x02, '\a', '\C-g' )