crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.5k stars 1.62k forks source link

CharLiteral#ord / binary char syntax #9830

Closed kimburgess closed 1 year ago

kimburgess commented 4 years ago

Currently (0.35.1) there are some inconsistencies and areas of the language where it is not possible to access Char codepoints ~at compile time~ without external / manual conversion.

# Works
A = 'A'.ord

# Does not
enum Example
  A = 'A'.ord
end

# Also does not
B = {{'B`.ord}}

Some further info on the use case for this is explained on the forums.

While there may be some value in addressing the enum inconsistency, introducing CharLiteral#ord is only one approach.

From some issue digging, it looks like there has been some brief, previous discussion on a b'a' style syntax to allowing expressing UInt8's as their char equivalent. This seems like the elegant approach for general use.

Is this something of interest, and if so, is there a preferred approach between these two options? Happy to look at implementation, but would be good to discuss design (including if this shouldn't be done at all) before doing so.

asterite commented 4 years ago

Just a note: A = 'A'.ord is not doing that at compile-time.

Constants are runtime values.

kimburgess commented 4 years ago

Looking at the current form of the compiler, the cause for the Enum inconsistency is that expressions there are evaluated by the Crystal::MathInterpreter. Adding support for eval of an Char#ord call upstream of that will likely be extremely hacky, prone to error or likely both of these things.

Some good points were also raised on the forum re the expansion of macros increasingly leading to two seperate languages that need to be maintained, so it seem like adding a CharLiteral#ord is worth avoiding.

With the above in mind, expanding the lexer to support a 'char as codepoint' syntax looks to be a good option if this is something of interest.

Q's...

  1. Is this something of interest?
  2. If so, is there a preferred syntax?

On syntax, some options:

b'a' as previously suggested. This mirrors some other languages, namely Rust's byte literals. Worth noting that there is the also the concept of a bytes string literal, which would also map neatly to Crystal's Bytes.

0ca which extends the existing syntax for expressing integers as binary, octal and hexadecimal. This would also be a quick implementation thanks to the extisting scan_zero_number.

sol-vin commented 4 years ago

@KimBurgess I love the syntax idea for a b'a' and 0ca both I think could be of great value. I can also see b'a' working well with unicode since we can put it into a Bytes instead of just a single integer. I'm not sure if 0ca could be supported with unicode though, could be an issue with that specific form. Maybe instead 0c'a' to keep the single quote syntax for chars?

kimburgess commented 4 years ago

I'm not sure if supporting Unicode chars -> Bytes would be the best with the b'a' syntax as this would introduce ambiguity for the output kind. It could however be used to provide the codepoint as an appropriate unsigned integer type, mirroring the behaviour of the the existing binary, hex and octal number literals.

b'a' == 0x61

b'◆' == 0x25c6

I do however really like the 0c'a' syntax as this is a much neater match the existing ways of expressing number literals.

straight-shoota commented 3 years ago

I wouldn't add an additional literal syntax for this. It's to much of a niche use case to justify that. Most developers would likely be unfamiliar with the syntax because it's so rare.

CharLiteral#ord seems like a practical solution, though. It's a little addition to the macro language, but it's well defined and reasonably useful, even for other problems.

Andriamanitra commented 1 year ago

CharLiteral#ord would be useful to me in an interpreter for brainfuck style language where every instruction is a single byte. Currently I have to define operations like this which is IMO rather unclear (and there's no guarantee that the numbers and comments match, which can lead to hard to find bugs):

Inc   = 42_u8 # '+'
Dec   = 43_u8 # '-'
Print = 45_u8 # '.'

I would like to be able to define the same operations like this instead:

Inc   = '+'.ord.to_u8
Dec   = '-'.ord.to_u8
Print = '.'.ord.to_u8

...and have the Char to UInt8 conversion happen at compile time so compilation fails if I accidentally use a multi-byte character.

soya-daizu commented 1 year ago

Another use case I run into recently is generating enum values for V4L2 bindings (postmodern/v4l2.cr) where each enum value is essentially an integer of each character's char code combined:

enum Linux::V4L2PixFmt : Linux::U32
  RGB332  = v4l2_fourcc('R', 'G', 'B', '1') #  8  RGB-3-3-2
  RGB444  = v4l2_fourcc('R', '4', '4', '4') # 16  xxxxrrrr ggggbbbb
  ARGB444 = v4l2_fourcc('A', 'R', '1', '2') # 16  aaaarrrr ggggbbbb
  XRGB444 = v4l2_fourcc('X', 'R', '1', '2') # 16  xxxxrrrr ggggbbbb
  RGBA444 = v4l2_fourcc('R', 'A', '1', '2') # 16  rrrrgggg bbbbaaaa
  RGBX444 = v4l2_fourcc('R', 'X', '1', '2') # 16  rrrrgggg bbbbxxxx
  ABGR444 = v4l2_fourcc('A', 'B', '1', '2') # 16  aaaabbbb ggggrrrr
  XBGR444 = v4l2_fourcc('X', 'B', '1', '2') # 16  xxxxbbbb ggggrrrr
  # ...
end
macro v4l2_fourcc(a,b,c,d)
  {% 
    # HACK: because CharLiteral#ord doesn't exist yet
    ascii_table = {
      ' ' => 32_u32,
      '0' => 48_u32,
      '1' => 49_u32,
      '2' => 50_u32,
      # ...
      'A' => 65_u32,
      'B' => 66_u32,
      'C' => 67_u32,
      # ...
      'X' => 88_u32,
      'Y' => 89_u32,
      'Z' => 90_u32,
      # ...
    } 
  %}

  {% begin %}
  {{ (ascii_table[a] | (ascii_table[b] << 8) | (ascii_table[c] << 16) | (ascii_table[d] << 24)) }}
  {% end %}
end