gbdev / rgbds

Rednex Game Boy Development System - An assembly toolchain for the Nintendo Game Boy and Game Boy Color
https://rgbds.gbdev.io
MIT License
1.35k stars 172 forks source link

[Feature request] Strings as byte arrays #938

Open Rangi42 opened 3 years ago

Rangi42 commented 3 years ago

In most contexts, strings are already just byte sequences. String literals can contain any bytes (except for \0, which currently terminates the string, but using C++ std::string could avoid this). String functions like STRCAT and STRRPL operate on the bytes and do not care about encoding. Even print and println just send the bytes to stdout; things print as UTF-8 iff that is set as the console's locale.

The only functions that warn about strings which aren't valid UTF-8 are STRLEN and STRSUB. I think this is actually a mistake, and we should have STRLENUTF8 and STRSUBUTF8 if that behavior is desired.

You would expect db "{s}" to declare STRLEN("{s}") many bytes, but actually STRLEN undercounts since there are multi-byte UTF-8 characters. STRLEN("héllo") == 5, but db "héllo" declares 6 bytes, 68 c3 a9 6c 6c 6f.

If strings acted as byte arrays, and #885 allowed \0 bytes in strings, then #933 could implement a single READFILE function for both text and binary files. We would not need to implement numeric arrays (#67) just for that one use case (and given all the open questions about how arrays should behave, and the lack of string arrays anyway, I'd rather not have them.)

Changing the behavior of STRLEN and STRSUB would be a potentially breaking change, but I think it would be better than adding "STRBYTELEN" and "STRBYTESUB" functions, since UTF-8 encoding is the unusual special case. Note that rgbds-struct's uses of STRLEN and STRSUB would all be valid even if the definitions were changed; and hypothetical cases that would break should probably be using CHARLEN and CHARSUB anyway.)

(One other useful function would be STRBYTE(str, idx), to get the raw byte value at an index, without going through the charmap. That is, STRSUB("ABCD", 2, 1) and CHARSUB("ABCD", 2) return the string "B" which coerces to the number $42 if you haven't charmapped it; but STRBYTE("ABCD", 2) would return $42 directly.)

(Another nice addition along with this would be to allow \0 as a way to put $00 bytes in strings. It can be inconvenient to have literal null bytes in a file, but all the others are fine.)

We would probably also want to get rid of the "Input string is not valid UTF-8!" warning in charmap.c, which I think is the only other place where UTF-8 encoding matters.

ISSOtm commented 3 years ago

You would expect db "{s}" to declare STRLEN("{s}") many bytes, but actually STRLEN undercounts since there are multi-byte UTF-8 characters. STRLEN("héllo") == 5, but db "héllo" declares 6 bytes, 68 c3 a9 6c 6c 6f.

No, you wouldn't, because charmaps.

Rangi42 commented 3 years ago

No, you wouldn't, because charmaps.

That's assuming there are no charmaps involved besides the default one, so STRLEN("{s}") == CHARLEN("{s}").

Rangi42 commented 3 years ago

Basically these are what I see as the four ways forward:

  1. The status quo: we have strings for which STRLEN and STRSUB expect UTF-8 encoding. We're going to switch from leaky char *s to ref-counted struct Strings or RAII-with-smart-pointers std::strings to allow unlimited string lengths. We could also add a READFILE function to read the contents of a UTF-8 text file as a string, but can't handle binary files. One of the motivating use cases for even adding file-reading functions was to add an offset to the bytes of a tilemap, which would have to be binary, so I think we should try to allow that.
  2. Add arrays/lists and a READBIN function to return an array for binary files, plus READFILE for strings for UTF-8 text files. Given the uncertainties and tradeoffs we ran into when considering how arrays would be implemented, and how major a feature it would be mostly just for the sake of enabling READBIN, I'd rather not do that.
  3. Let READFILE return a string for binary files too. Define new functions STRBYTELEN, STRBYTESUB, and STRBYTE to get the length, substrings, and individual bytes from a string, without expecting any particular text encoding. This would still require us to allow $00 bytes in strings, but that's not a problem; it's feasible as long as we don't rely on string.h functions for algorithms (which neither the struct Strings nor std::strings need to do).
  4. Let READFILE return a string for binary files too. Change STRLEN and STRSUB to not expect any particular text encoding, and add STRBYTE to get individual bytes from a string. Optionally add STRLENUTF8 and STRSUBUTF8 to allow the current behavior (which I do think is worthwhile, even though in most cases where you care, you should probably be using CHARLEN and CHARSUB).

I could certainly be missing an even better fifth way of allowing users to access binary file contents, so here or #933 is fine for discussing that (or #67 if arrays are the preferred solution).

aaaaaa123456789 commented 3 years ago

I'd say #3 is the best option by far.

Rangi42 commented 3 years ago

Hm, I would somewhat prefer 4 since I expect (a) non-UTF-8-specific would be the more common use case, and (b) hopefully few/no users are depending on UTF-8 STRLEN and STRSUB so far; but either would be fine with me.

aaaaaa123456789 commented 3 years ago

STRLEN and STRSUB have to expect some encoding; there's no meaningful concept of "string length" without one. The encoding where every byte encodes itself is an encoding (ISO-8859-1).

Rangi42 commented 3 years ago

Option 3 would make STRLEN behave like C's strlen (except without needing the $00 terminator after we finish PR #885, i.e. STRLEN would return the struct String's size value). STRSUB would likewise act like taking a segment of a char[] array. Neither of those cares about the encoding; the string is just an array of bytes.

True, ISO-8859-1 is an encoding that has single-byte characters, but it's not the only one. And the rgbasm language would not be taking a position on which Unicode code points go with which byte values in strings. So I don't think of option 3 as "switch from UTF-8 to ISO-8859-1", but "switch from UTF-8 to arbitrary unsigned byte values". Even charmaps don't really care about Unicode; the character set only becomes relevant when you print things, and that's up to your console. (Also ISO-8859-1 does not define characters for 00-1F or 7F-9F.)

ISSOtm commented 2 years ago

Given https://hsivonen.fi/string-length, I'm for option 4 as well.