WebAssembly / stringref

Other
37 stars 2 forks source link

Policy immediates #35

Closed kripken closed 2 years ago

kripken commented 2 years ago

Maybe a matter of taste, but I think instead of policy immediates it would be more consistent to have separate opcodes. That is, instead of

string.new_wtf8 $policy
string.new_wtf16

where $policy is in utf8,wtf8,replace, we could have

string.new_utf8
string.new_wtf8
string.new_wtf8_replace ;; or utf8 here? I don't know unicode...
string.new_wtf16

For comparison, in GC ref.as_func / ref.as_data / .. etc. each have a different opcode, instead of having one opcode + an immediate.

wingo commented 2 years ago

Yeah, interesting idea. I think this would mean about 11 or 12 more instructions (6 policy-using instructions), but essentially the same implementation complexity.

Incidentally, I have been thinking for a while that both the name "policy" / "wtf8_policy" and the different values here are misnamed -- it's not just what to do when you see a surrogate codepoint or if you "see WTF-8", because since #21 the "replace" policy also specifies how to interpret any invalid byte sequence, lossily discarding those byte sequences and replacing them with U+FFFD.

Anyway, some names, maybe:

string.new_wtf8
string.new_utf8
string.new_lossy_utf8  ;; replaces surrogates and decoding errors with U+FFFD
string.new_wtf8_array
string.new_utf8_array
string.new_lossy_utf8_array

string.measure_utf8
string.measure_wtf8 ;; same as what string.measure_lossy_utf8 would be;
                    ;; encoded length of U+FFFD same as surrogate

string.encode_utf8
string.encode_wtf8
string.encode_lossy_utf8
string.encode_utf8_array
string.encode_wtf8_array
string.encode_lossy_utf8_array

;; I find these names very displeasing, aesthetically speaking, but I don't know of better
stringview_wtf8.encode_utf8
stringview_wtf8.encode_wtf8
stringview_wtf8.encode_lossy_utf8