JuliaHubOSS / llvm-cbe

resurrected LLVM "C Backend", with improvements
Other
826 stars 141 forks source link

Non-null-terminated strings end up pretty ugly #124

Closed hikari-no-yume closed 3 years ago

hikari-no-yume commented 3 years ago

I noticed when trying to compile some Rust-generated LLVM IR to C that non-null-terminated strings, which at the LLVM IR level are just arrays of i8, end up pretty gnarly compared to their null-terminated counterparts. For example:

@alloc73 = private unnamed_addr constant <{ [37 x i8] }> <{ [37 x i8] c"Unrecognised argument, expected o/x/_" }>, align 1
@alloc78 = private unnamed_addr constant <{ [12 x i8] }> <{ [12 x i8] c"Cross wins!\00" }>, align 1

type definitions ```c struct l_array_37_uint8_t { uint8_t array[37]; }; #ifdef _MSC_VER #pragma pack(push, 1) #endif struct l_unnamed_7 { struct l_array_37_uint8_t field0; } __attribute__ ((packed)); #ifdef _MSC_VER #pragma pack(pop) #endif struct l_array_12_uint8_t { uint8_t array[12]; }; #ifdef _MSC_VER #pragma pack(push, 1) #endif struct l_unnamed_8 { struct l_array_12_uint8_t field0; } __attribute__ ((packed)); #ifdef _MSC_VER #pragma pack(pop) #endif ```
static struct l_unnamed_7 alloc73 = { { { 85u, 110u, 114u, 101u, 99u, 111u, 103u, 110u, 105u, 115u, 101u, 100u, 32, 97u, 114u, 103u, 117u, 109u, 101u, 110u, 116u, 44, 32, 101u, 120u, 112u, 101u, 99u, 116u, 101u, 100u, 32, 111u, 47, 120u, 47, 95u } } };
static struct l_unnamed_8 alloc78 = { { "Cross wins!" } };

I might try using character literals for i8 initialisers in the printable ASCII range. { ' U', 'n', 'r', 'e', 'c', 'o', 'g', 'n', 'i', 's', 'e', 'd', … wouldn't be ideal of course, but it would take up no more space than the current integers, and it would be more readable.

I'm not sure what's best in general for the control-character and non-ASCII ranges.

hikari-no-yume commented 3 years ago

Actually, if I'm interpreting the C spec correctly, it should be possible to use a string literal here! The C99 spec says (emphasis mine):

An array of character type may be initialized by a character string literal, optionally enclosed in braces. Successive characters of the character string literal (including the terminating null character if there is room or if the array is of unknown size) initialize the elements of the array.

I need to check this is true for C89 too, but it's hopeful.

Edit: Seems like this applies to C89 too.