Writing a string with NUL characters to a socket results in 0xc0 0x80 utf-8 sequences

tve commented 11 months ago

Moddable SDK version: 4.3 Target device: esp32

Description In an HTTP prepareResponse handler I'm returning a string body that happens to contain NUL characters ('\0'). On the client side these come across as 0xc0 0x80 byte pairs and a byte is missing at the end.

Steps to Reproduce

Run the following modification of the httpserverputfile example:


import { Server } from "http"
import { File } from "file"
import Net from "net"

new Server({}).callback = function (message, value) { switch (message) { case Server.headersComplete: // prepare for request body return String

case Server.prepareResponse: // request body received
  return { status: 200, body: "Here is a [\0] byte" }

} }

trace(Available on Wi-Fi "${Net.get("SSID")}"\n) trace( curl --data-binary "@/users/[your directory path here]/test.txt" http://${Net.get( "IP" )}/test.txt -v\n )

2. Perform a curl request and observe the weird response:

curl http://192.168.0.216/ | od -c % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 18 100 18 0 0 198 0 --:--:-- --:--:-- --:--:-- 200 0000000 H e r e i s a [ 300 200 ] b 0000020 y t 0000022
Notice the 0xc0 (0300) and 0x80 (0200) instead of the NUL and the missing 'e' at the end.

Run a tcpdump to verify that this is not a curl misbehavior and also that the final 'e' is actually transmitted but omitted by curl due to the content-length:


    0x0020:  5019 1fb4 1aed 0000 4854 5450 2f31 2e31  P.......HTTP/1.1
    0x0030:  2032 3030 204f 4b0d 0a63 6f6e 6e65 6374  .200.OK..connect
    0x0040:  696f 6e3a 2063 6c6f 7365 0d0a 636f 6e74  ion:.close..cont
    0x0050:  656e 742d 6c65 6e67 7468 3a20 3138 0d0a  ent-length:.18..
    0x0060:  0d0a 4865 7265 2069 7320 6120 5bc0 805d  ..Here.is.a.[..]
    0x0070:  2062 7974 65                             .byte


Update: any response string with a non-ascii unicode character will trigger the incorrect content-length issue. For example, `return { status: 200, body: "Here is a [©] byte" }` results in

(main) /h/s/m/j/host> curl http://192.168.0.216/ | od -c % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 18 100 18 0 0 34 0 --:--:-- --:--:-- --:--:-- 34 0000000 H e r e i s a [ 302 251 ] b 0000020 y t 0000022


Note the missing 'e' at the end.

phoddie commented 11 months ago

Yes, this is known. No plans to fix it in this server. The caller can use TextEncoder to convert the string to binary UTF-8 and write that instead.

FWIW – the ECMA-419 HTTP client and server only operate on binary data (not strings). That's deliberate so that these text conversion issues remain external.

tve commented 11 months ago

Yes, this is known. No plans to fix it in this server.

Noting this in the docs would have saved me a bunch of time and aggravation...

tve commented 11 months ago

FWIW – the ECMA-419 HTTP client and server

Thanks for pointing this out again, ~I'm not finding the HTTPServer implementation in the Moddable SDK, am I missing something?~ found it

phoddie commented 11 months ago

(Added a note to the docs)

tve commented 11 months ago

FYI: the issue with NUL characters seems to also affect Array.fromString. Some uses, like in the ecma-419 mqtt client look vulnerable. I assume this is known and won't-fix.

phoddie commented 11 months ago

The default XS string encoding is almost UTF-8. The exception is NULLs which use the CESU-8 encoding. Here's a modified version of ArrayBuffer.fromString that handles the NULL encoding.

fx_ArrayBuffer_fromString

```c void fx_ArrayBuffer_fromString(txMachine* the) { txSize length; if (mxArgc < 1) mxTypeError("no argument"); length = mxStringLength(fxToString(the, mxArgv(0))); txString c = mxArgv(0)->value.string, end = c + length - 1; txInteger nulls = 0; while (c < end) { if ((0xc0 == (uint8_t)c[0]) && (0x80 == (uint8_t)c[1])) nulls += 1; c++; } fxConstructArrayBufferResult(the, mxThis, length - nulls); if (!nulls) c_memcpy(mxResult->value.reference->next->value.arrayBuffer.address, mxArgv(0)->value.string, length); else { txString c = mxArgv(0)->value.string, end = c + length; txByte *out = mxResult->value.reference->next->value.arrayBuffer.address; while (c < end) { if ((0xc0 == (uint8_t)c[0]) && (0x80 == (uint8_t)c[1])) { *out++ = 0; c += 2; } else *out++ = *c++; } } } ```

Here's the corresponding change to String.fromArrayBuffer:

fx_String_fromArrayBuffer

```c void fx_String_fromArrayBuffer(txMachine* the) { txSlot* slot; txSlot* arrayBuffer = C_NULL, *sharedArrayBuffer = C_NULL; txSlot* bufferInfo; txInteger limit, offset; txInteger inLength, outLength = 0, nulls = 0; unsigned char *in; txString string; if (mxArgc < 1) mxTypeError("no argument"); slot = mxArgv(0); if (slot->kind == XS_REFERENCE_KIND) { slot = slot->value.reference->next; if (slot) { bufferInfo = slot->next; if (slot->kind == XS_ARRAY_BUFFER_KIND) arrayBuffer = slot; else if (slot->kind == XS_HOST_KIND) { if (!(slot->flag & XS_HOST_CHUNK_FLAG) && bufferInfo && (bufferInfo->kind == XS_BUFFER_INFO_KIND)) sharedArrayBuffer = slot; } } } if (!arrayBuffer && !sharedArrayBuffer) mxTypeError("argument is no ArrayBuffer instance"); limit = bufferInfo->value.bufferInfo.length; offset = fxArgToByteLength(the, 1, 0); if (limit < offset) mxRangeError("out of range byteOffset %ld", offset); inLength = fxArgToByteLength(the, 2, limit - offset); if ((limit < (offset + inLength)) || ((offset + inLength) < offset)) mxRangeError("out of range byteLength %ld", inLength); in = offset + (unsigned char *)(arrayBuffer ? arrayBuffer->value.arrayBuffer.address : sharedArrayBuffer->value.host.data); while (inLength > 0) { unsigned char first = c_read8(in++), clen; if (first < 0x80){ if (0 == first) nulls += 1; inLength -= 1; outLength += 1; continue; } if (0xC0 == (first & 0xE0)) clen = 2; else if (0xE0 == (first & 0xF0)) clen = 3; else if (0xF0 == (first & 0xF0)) clen = 4; else goto badUTF8; inLength -= clen; if (inLength < 0) goto badUTF8; outLength += clen; clen -= 1; do { if (0x80 != (0xc0 & c_read8(in++))) goto badUTF8; } while (--clen > 0); } string = fxNewChunk(the, outLength + nulls + 1); if (!nulls) c_memcpy(string, offset + (txString)(arrayBuffer ? arrayBuffer->value.arrayBuffer.address : sharedArrayBuffer->value.host.data), outLength); else { txString c = string, end = c + outLength + nulls; txString buf = offset + (txString)(arrayBuffer ? arrayBuffer->value.arrayBuffer.address : sharedArrayBuffer->value.host.data); while (c < end) { txByte b = *buf++; if (b) *c++ = b; else { *c++ = 0xC0; *c++ = 0x80; } } } string[outLength + nulls] = 0; mxResult->value.string = string; mxResult->kind = XS_STRING_KIND; return; badUTF8: mxTypeError("invalid UTF-8"); } ```

Let me know if this works for you. If it does, I'll merge it.

tve commented 11 months ago

CESU-8: learn something new every day... I'll plug it in and report back, thanks!

phoddie commented 11 months ago

I'll plug it in and report back, thanks!

Great. How'd it go?

tve commented 11 months ago

My test for Array.fromString looks good, ~the impact on larger buffers (1KB) is ~2x but on smaller strings I would call it "in the noise"~ (using esp32):

import Time from "time"
import TextEncoder from "text/encoder"

const ten = "0123456789"
const hundred = ten + ten + ten + ten + ten + ten + ten + ten + ten + ten
const thousand =
  hundred + hundred + hundred + hundred + hundred + hundred + hundred + hundred + hundred + hundred

const ten0 = "01234\x006789"
const hundred0 = ten0 + ten0 + ten0 + ten0 + ten0 + ten0 + ten0 + ten0 + ten0 + ten0
const thousand0 =
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0

function timeit(desc: string, f: () => void): void {
  const t0 = Time.ticks
  f()
  const t1 = Time.ticks
  trace(`${desc} took ${t1 - t0}ms\n`)
}

const tests = [
  ["ten", ten],
  ["hundred", hundred],
  ["thousand", thousand],
  ["ten0", ten0],
  ["hundred0", hundred0],
  ["thousand0", thousand0],
]

trace("\n===== String NUL test =====\n")

for (const test of tests) {
  const ab = ArrayBuffer.fromString(test[1])
  const u8 = new Uint8Array(ab)
  const te = new TextEncoder().encode(test[1])
  const same = u8.byteLength == te.byteLength && u8.every((v, i) => v == te[i])
  trace(
    `${test[0]} ${test[1].length} ${ab.byteLength} ` +
      `${u8.slice(0, 8)}...${u8[u8.byteLength - 1]} same=${same}\n`
  )
}

for (const test of tests) {
  timeit(`${test[0]} fromString`, () => {
    const s = test[1]
    for (let i = 0; i < 100; i++) {
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
    }
  })
}

// Before fix:
// ===== String NUL test =====
// ten 10 10 48,49,50,51,52,53,54,55...57 same=true
// hundred 100 100 48,49,50,51,52,53,54,55...57 same=true
// thousand 1000 1000 48,49,50,51,52,53,54,55...57 same=true
// ten0 10 11 48,49,50,51,52,192,128,54...57 same=false
// hundred0 100 110 48,49,50,51,52,192,128,54...57 same=false
// thousand0 1000 1100 48,49,50,51,52,192,128,54...57 same=false
// ten fromString took 37ms
// hundred fromString took 40ms
// thousand fromString took 67ms
// ten0 fromString took 37ms
// hundred0 fromString took 41ms
// thousand0 fromString took 69ms
//
// After fix:
// ===== String NUL test =====
// ten 10 10 48,49,50,51,52,53,54,55...57 same=true
// hundred 100 100 48,49,50,51,52,53,54,55...57 same=true
// thousand 1000 1000 48,49,50,51,52,53,54,55...57 same=true
// ten0 10 10 48,49,50,51,52,0,54,55...57 same=true
// hundred0 100 100 48,49,50,51,52,0,54,55...57 same=true
// thousand0 1000 1000 48,49,50,51,52,0,54,55...57 same=true
// ten fromString took 26ms
// hundred fromString took 33ms
// thousand fromString took 107ms
// ten0 fromString took 22ms
// hundred0 fromString took 39ms
// thousand0 fromString took 174ms

~similar test for String.fromArrayBuffer will follow~ String.fromArrayBuffer looks good too.

Edit:

For ArrayBuffer.fromString 1000 bytes: post-fix-no-NUL to pre-fix is ~1.6x, and post-fix-with-NUL to post-fix-no-NUL is also ~1.6x; for 100 bytes or less the difference is noise.
For String.fromArrayBuffer 1000 bytes: post-fix-no-NUL to pre-fix is ~1.1x, and post-fix-with-NUL to post-fix-no-NUL is also ~1.5x; for 100 bytes or less the difference is noise.

phoddie commented 11 months ago

Thanks for the tests and benchmarks. Given the need for an extra pass over the data, it is going to take more time. Since the actual work is trivial, memory bandwidth is the limiting factor. Since this isn't generally performance critical, I think it is an OK tradeoff and keeps the code small.

I just noticed that fx_ArrayBuffer_fromString can be optimized. The first thing it does is get the length of the string. That can be combined with the pass that searches for nulls. That should make the performance much closer to the original. Here's the updated version.

fx_ArrayBuffer_fromString 2

```c void fx_ArrayBuffer_fromString(txMachine* the) { txSize length = 0; if (mxArgc < 1) mxTypeError("no argument"); txString c = mxArgv(0)->value.string; txInteger nulls = 0; while (true) { uint8_t b = (uint8_t)c_read8(c++); if (!b) break; length += 1; if ((0xc0 == b) && (0x80 == (uint8_t)c_read8(c))) nulls += 1; } fxConstructArrayBufferResult(the, mxThis, length - nulls); if (!nulls) c_memcpy(mxResult->value.reference->next->value.arrayBuffer.address, mxArgv(0)->value.string, length); else { txString c = mxArgv(0)->value.string, end = c + length; txByte *out = mxResult->value.reference->next->value.arrayBuffer.address; while (c < end) { uint8_t b = (uint8_t)c_read8(c++); if ((0xc0 == (uint8_t)b) && (0x80 == (uint8_t)c_read8(c))) { *out++ = 0; c += 1; } else *out++ = b; } } } ```

How's that look?

Separately, I would expect String.fromArrayBuffer() performance to be more-or-less unchanged. There's no additional pass over the data. If there are nulls, it will be a little slower because it can't use memcpy to transfer the bytes, but in the common case of no nulls it should pretty much match.

tve commented 11 months ago

For ArrayBuffer.fromString

1000 bytes: post-fix-no-NUL to pre-fix is ~1.1x, and post-fix-with-NUL to post-fix-no-NUL is ~1.8x
for 100 bytes or less the difference is noise.

For String.fromArrayBuffer

1000 bytes: post-fix-no-NUL to pre-fix is ~1.1x, and post-fix-with-NUL to post-fix-no-NUL is ~1.5x
for 100 bytes or less the difference is noise.

phoddie commented 11 months ago

Thanks for rechecking. The fromString optimization seems to be working nicely. I'll integrate the changes.

Moddable-OpenSource / moddable

Writing a string with NUL characters to a socket results in 0xc0 0x80 utf-8 sequences #1271