apache / openwhisk

Apache OpenWhisk is an open source serverless cloud platform
https://openwhisk.apache.org/
Apache License 2.0
6.49k stars 1.16k forks source link

properly handle unicode characters #252

Closed rabbah closed 7 years ago

rabbah commented 8 years ago

The decoding is localized to reading records from the database (on the client side) which is normalizing to the HTTP default charset ISO-8859-1 instead of UTF-8.

As a sanity check, I confirmed that for a sample code:

function main() {
  var frenchMessage = "Le périphérique est connecté";
  var chineseMesage = "装置被连接";
  return ({"fr": frenchMessage, "zh-Hans": chineseMesage});
}

The record is stored correctly in the database. If I were to encode the string using base64 encoding and logging the result, the activation result and logs are correctly stored in the database:

function main() {
  var msg = 'eyJmciI6IkxlIHDDqXJpcGjDqXJpcXVlIGVzdCBjb25uZWN0w6kiLCJ6aC1IYW5zIjoi6KOF572u6KKr6L+e5o6lIn0=';
  var dec = new Buffer(msg, 'base64').toString('utf-8');
  console.log(dec);
  var res = JSON.parse(dec);
  return res;
}

results in this activation record:

"response": {
    "statusCode": 0,
    "result": {
      "fr": "Le périphérique est connecté",
      "zh-Hans": "装置被连接"
    }
  },
  "logs": [
    "2016-04-23T13:36:16.030199912Z stdout: {\"fr\":\"Le périphérique est connecté\",\"zh-Hans\":\"装置被连接\"}"
  ],

@psuter FYI this is true both using the Cloudant SDK as well as the Spray client.

stevenatkin commented 8 years ago

Can we just normalize to UTF-8 instead of ISO8859-1. Until this is fixed it will be difficult for anyone to use OpenWhisk in anything other than English.

midheartty commented 7 years ago

Using Unicode escape sequence in action code also could produce UTF-8 output:

var chineseMesage = "\u88c5\u7f6e\u88ab\u8fde\u63a5";

rabbah commented 7 years ago

@psuter can you summarize your findings for future investigators?

midheartty commented 7 years ago

I found a S/O discussion regarding the encoding declaration for HttpEntity in scala/spray

http://stackoverflow.com/questions/36456796/spray-test-case-doesnt-property-unmarshal-utf-8-encoded-json-using-httpentity

Hope this solution would help.

psuter commented 7 years ago

So this is slightly fuzzy in my head.. I believe I had tracked it down to being an issue only happening when the JVMs are running inside a Docker container. (The DB seems to store everything properly.) I had diff'ed the JVM system properties when running outside of a container and when running with the image we use for all Scala containers, and there was only one difference. I believe it was a property related to the byte order mark... not 100% sure. I also believe this was a fixed property, not one that the user can configure.

psuter commented 7 years ago

Also, my branch https://github.com/psuter/openwhisk/tree/utf-8-db had some tests and (unsuccessful) attempts.

midheartty commented 7 years ago

I confirmed the DB (CouchDB) stores action codes with non-English data correctly by reviewing the CouchDB web interface for my vagrant install.

http://172.17.0.1:5984/_utils/database.html?vagrant_vagrant-ubuntu-trusty-64_whisks (Overview > vagrant_vagrant-ubuntu-trusty-64_whisks > guest/myaction)

This may mean wrong charset conversion happens when loading action codes or producing a response in the Invoker.

steveatkin commented 7 years ago

I suspect that OpenWhisk is using the wrong codepage when actions are being executed. My guess is that they are using ISO8859-1 (Latin 1) when they should be using UTF-8 (Unicode). This should be a simple configuration change and would enable them to avoid having to do a conversion as the actions and data are most likely in UTF-8 already.

csantanapr commented 7 years ago

Getting more reports from users being hit with this problem, like using a parameter with value Aurélien the é not being handle correctly.

stevenatkin commented 7 years ago

I think that this really needs to get fixed and I think it is tied to #362 as well.

rabbah commented 7 years ago

There are many instances of this bug - fixing this will close the most issues in one commit :1st_place_medal:

rabbah commented 7 years ago

Here's a hint: actions via zip files work. This rules out (as both @psuter and I confirmed with separate unit tests) that it's not the action container but likely the interface between the container and the invoker for exchanging arguments.

stevenatkin commented 7 years ago

Ok so this means that it is probably the code that is reading the files. It may be reading it in the wrong codepage. Check to see if they have hardcoded that. If the code that reads the files is written in Java then the default codepage is probably ISO8859-1. I don't believe that UTF-8 is the default you need to explicit state the encoding.

rabbah commented 7 years ago

found it!

csantanapr commented 7 years ago

Don't keep it a secret !!!

csantanapr commented 7 years ago

Let me get some 🍿 , brb

stevenatkin commented 7 years ago

Awesome :)

rabbah commented 7 years ago

Proposed patch https://github.com/openwhisk/openwhisk/commit/b0e35e66b4f3c6e2914a9c30be2c3b356d7d107b.

rabbah commented 7 years ago

I confirmed a couple of issues were duplicates and closed them (and tests cover them). The Java action proxy, and the python proxy for the other runtimes will need to be patched. These are confirmed to be in the containers themselves, and not the Invoker (as was the case with node.js). #1757 fixes the node.js actions as a first step.

EDIT: Java actions also fixed in same PR.

midheartty commented 7 years ago

Great!

kaya-mikio commented 7 years ago

Thanks!

rabbah commented 7 years ago

Fix merged in #1757.