Encoding issues when embedding stringified data in html

eshaz / simple-yenc

Minimalist JavaScript binary string encoder / decoder with 1-2% overhead, compared to 33%-40% overhead for 6-bit encoding methods like Base64.

MIT License

17 stars 3 forks source link

Encoding issues when embedding stringified data in html #1

Closed eyaler closed 2 years ago

eyaler commented 2 years ago

My use case is to have the binary data embedded in a JS script within a single HTML file, and your solution would give an amazing benefit - if I can get it to work... I have been struggling with encoding issues: If I save the html as utf8 i lose the efficiency as bytes 128-255 get prepended with 194 or 195, however this renders ok. But if I save the html as binary bytes to retain the efficiency, I was not able to find the correct way to specify the html-charset and I was not able to get the html to decode correctly. It would be amazing I you can provide an working html example, and any other help would be highly appreciated.

eyaler commented 2 years ago

solved it!!!

save html as binary bytes
in html meta specify charset=cp1252 (or ascii or latin1 or iso-8859-1)
in the decode function add in simple-yenc.js#L38: if(byte>255) byte=128+[8364,,8218,402,8222,8230,8224,8225,710,8240,352,8249,338,,381,,,8216,8217,8220,8221,8226,8211,8212,732,8482,353,8250,339,,382,376].indexOf(byte) (this is based on https://stackoverflow.com/a/10081375/664456 where i have changed to ints to avoid encoding issues...)

if you are interested i can make a PR. also I have a working encoder+stringifier in Python i can contribute. you are awesome.

eshaz commented 2 years ago

I'm glad you figured it out! I would definitely welcome a PR.

In addition to the fix, would you also be able to put in a test case here in your PR?

eyaler commented 2 years ago

pr pending + discussing the testing here: https://github.com/eshaz/simple-yenc/pull/2

also not sure if this is of any interest but in the context of stringified data i found that:

in the encoder: no need to = escape \n. you still need to escape \r as it will be automatically normalized to \n, but the \n works fine
in the decoder: no need to continue on \n or \r.