Open Rangi42 opened 3 years ago
You would expect
db "{s}"
to declareSTRLEN("{s}")
many bytes, but actuallySTRLEN
undercounts since there are multi-byte UTF-8 characters.STRLEN("héllo") == 5
, butdb "héllo"
declares 6 bytes,68 c3 a9 6c 6c 6f
.
No, you wouldn't, because charmaps.
No, you wouldn't, because charmaps.
That's assuming there are no charmaps involved besides the default one, so STRLEN("{s}") == CHARLEN("{s}")
.
Basically these are what I see as the four ways forward:
STRLEN
and STRSUB
expect UTF-8 encoding. We're going to switch from leaky char *
s to ref-counted struct String
s or RAII-with-smart-pointers std::string
s to allow unlimited string lengths. We could also add a READFILE
function to read the contents of a UTF-8 text file as a string, but can't handle binary files. One of the motivating use cases for even adding file-reading functions was to add an offset to the bytes of a tilemap, which would have to be binary, so I think we should try to allow that.READBIN
function to return an array for binary files, plus READFILE
for strings for UTF-8 text files. Given the uncertainties and tradeoffs we ran into when considering how arrays would be implemented, and how major a feature it would be mostly just for the sake of enabling READBIN
, I'd rather not do that.READFILE
return a string for binary files too. Define new functions STRBYTELEN
, STRBYTESUB
, and STRBYTE
to get the length, substrings, and individual bytes from a string, without expecting any particular text encoding. This would still require us to allow $00 bytes in strings, but that's not a problem; it's feasible as long as we don't rely on string.h functions for algorithms (which neither the struct String
s nor std::string
s need to do).READFILE
return a string for binary files too. Change STRLEN
and STRSUB
to not expect any particular text encoding, and add STRBYTE
to get individual bytes from a string. Optionally add STRLENUTF8
and STRSUBUTF8
to allow the current behavior (which I do think is worthwhile, even though in most cases where you care, you should probably be using CHARLEN
and CHARSUB
).I could certainly be missing an even better fifth way of allowing users to access binary file contents, so here or #933 is fine for discussing that (or #67 if arrays are the preferred solution).
I'd say #3 is the best option by far.
Hm, I would somewhat prefer 4 since I expect (a) non-UTF-8-specific would be the more common use case, and (b) hopefully few/no users are depending on UTF-8 STRLEN
and STRSUB
so far; but either would be fine with me.
STRLEN
and STRSUB
have to expect some encoding; there's no meaningful concept of "string length" without one. The encoding where every byte encodes itself is an encoding (ISO-8859-1).
Option 3 would make STRLEN
behave like C's strlen
(except without needing the $00 terminator after we finish PR #885, i.e. STRLEN
would return the struct String
's size
value). STRSUB
would likewise act like taking a segment of a char[]
array. Neither of those cares about the encoding; the string is just an array of bytes.
True, ISO-8859-1 is an encoding that has single-byte characters, but it's not the only one. And the rgbasm language would not be taking a position on which Unicode code points go with which byte values in strings. So I don't think of option 3 as "switch from UTF-8 to ISO-8859-1", but "switch from UTF-8 to arbitrary unsigned byte values". Even charmaps don't really care about Unicode; the character set only becomes relevant when you print things, and that's up to your console. (Also ISO-8859-1 does not define characters for 00-1F or 7F-9F.)
Given https://hsivonen.fi/string-length, I'm for option 4 as well.
In most contexts, strings are already just byte sequences. String literals can contain any bytes (except for
\0
, which currently terminates the string, but using C++std::string
could avoid this). String functions likeSTRCAT
andSTRRPL
operate on the bytes and do not care about encoding. Evenprint
andprintln
just send the bytes to stdout; things print as UTF-8 iff that is set as the console's locale.The only functions that warn about strings which aren't valid UTF-8 are
STRLEN
andSTRSUB
. I think this is actually a mistake, and we should haveSTRLENUTF8
andSTRSUBUTF8
if that behavior is desired.You would expect
db "{s}"
to declareSTRLEN("{s}")
many bytes, but actuallySTRLEN
undercounts since there are multi-byte UTF-8 characters.STRLEN("héllo") == 5
, butdb "héllo"
declares 6 bytes,68 c3 a9 6c 6c 6f
.If strings acted as byte arrays, and #885 allowed
\0
bytes in strings, then #933 could implement a singleREADFILE
function for both text and binary files. We would not need to implement numeric arrays (#67) just for that one use case (and given all the open questions about how arrays should behave, and the lack of string arrays anyway, I'd rather not have them.)Changing the behavior of
STRLEN
andSTRSUB
would be a potentially breaking change, but I think it would be better than adding "STRBYTELEN
" and "STRBYTESUB
" functions, since UTF-8 encoding is the unusual special case. Note that rgbds-struct's uses ofSTRLEN
andSTRSUB
would all be valid even if the definitions were changed; and hypothetical cases that would break should probably be usingCHARLEN
andCHARSUB
anyway.)(One other useful function would be
STRBYTE(str, idx)
, to get the raw byte value at an index, without going through the charmap. That is,STRSUB("ABCD", 2, 1)
andCHARSUB("ABCD", 2)
return the string"B"
which coerces to the number $42 if you haven't charmapped it; butSTRBYTE("ABCD", 2)
would return $42 directly.)(Another nice addition along with this would be to allow
\0
as a way to put $00 bytes in strings. It can be inconvenient to have literal null bytes in a file, but all the others are fine.)We would probably also want to get rid of the "Input string is not valid UTF-8!" warning in charmap.c, which I think is the only other place where UTF-8 encoding matters.