Closed kripken closed 2 years ago
Yeah, interesting idea. I think this would mean about 11 or 12 more instructions (6 policy-using instructions), but essentially the same implementation complexity.
Incidentally, I have been thinking for a while that both the name "policy" / "wtf8_policy" and the different values here are misnamed -- it's not just what to do when you see a surrogate codepoint or if you "see WTF-8", because since #21 the "replace" policy also specifies how to interpret any invalid byte sequence, lossily discarding those byte sequences and replacing them with U+FFFD.
Anyway, some names, maybe:
string.new_wtf8
string.new_utf8
string.new_lossy_utf8 ;; replaces surrogates and decoding errors with U+FFFD
string.new_wtf8_array
string.new_utf8_array
string.new_lossy_utf8_array
string.measure_utf8
string.measure_wtf8 ;; same as what string.measure_lossy_utf8 would be;
;; encoded length of U+FFFD same as surrogate
string.encode_utf8
string.encode_wtf8
string.encode_lossy_utf8
string.encode_utf8_array
string.encode_wtf8_array
string.encode_lossy_utf8_array
;; I find these names very displeasing, aesthetically speaking, but I don't know of better
stringview_wtf8.encode_utf8
stringview_wtf8.encode_wtf8
stringview_wtf8.encode_lossy_utf8
Maybe a matter of taste, but I think instead of policy immediates it would be more consistent to have separate opcodes. That is, instead of
where
$policy
is inutf8,wtf8,replace
, we could haveFor comparison, in GC
ref.as_func / ref.as_data / ..
etc. each have a different opcode, instead of having one opcode + an immediate.