Open ScottChapman opened 6 years ago
@ScottChapman
we run
value = value.encode('utf-8')
merely to ensure that the data is valid unicode. the motivation behind this is that in the past we have received corrupted data from Message Hub and that message is passed as part of the payload to the request which itself attempts to encode the incoming data as part of the json
module. In fact, encode
runs an implicit decode
prior to attempting to actually encode. likewise decode
runs an implicit encode
prior to attempting to actually decode.
>>> '\xb6'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb6 in position 0: ordinal not in range(128)
>>> u'\xb6'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb6' in position 0: ordinal not in range(128)
As you can see the call to '\xb6'.encode('utf-8')
results in a UnicodeDecodeError and the call to u'\xb6'.decode('utf-8')
results in a UnicodeEncodeError
But the ultimate point of running value.encode('utf-8')
is merely to ensure that we are working with valid unicode before passing it down to other modules such as json
and requests
as those modules will surface errors if we don't verify beforehand
The data therefore is arriving corrupt and is not being corrupted by the use of this function call
Geez, that is super confusing ( ;-) ). All I could tell from the docs (and my understanding of Kafka) is that the messages are always bytes which need to be properly converted to text (assuming it is text of course). I think that looks like a destructive test though since it assigns value
to the result.
There is some discussion in the #whisk-users channel about some data corruption. It was confirmed that the data is fine through Kafka (consumer shows right data) and obviously the action can be passed proper data and it looks fine. So this looked suspicious since it is the "middleman" here.
The discussion is here: https://ibm-ics.slack.com/archives/C0BUS3JE8/p1534022249000007
I agree with Scott that there is no issue with the IBM Message Hub instance (Kafka) WRT corruption of messages. To confirm that was indeed the case, I created:
The producer posts a message with some English and some non-English characters to a topic called Greeter. The consumer polls the same topic and prints out the message.
Here is the output I am receiving from each:
Producer: diamond:target mmehra$ java -cp KafkaClient-0.0.1-SNAPSHOT-jar-with-dependencies.jar com.ibm.kafka.KafkaProducerClient SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. Producer: Posting message to topic: Greeter ==> {"greeting":"Howdy","user":"Malalažbeta"} Producer: Message posted to Partition: 0, Offset: 21 diamond:target mmehra$
Consumer: diamond:target mmehra$ java -cp KafkaClient-0.0.1-SNAPSHOT-jar-with-dependencies.jar com.ibm.kafka.KafkaConsumerClient SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. Consumer: Polling For New Messages on Greeter topic... Consumer: Received from topic => Partition: 0, Offset: 21, Message: {"greeting":"Howdy","user":"Malalažbeta"}
As you can see, the message is being received with the correct encoding from the IBM Message Hub topic. However, if I post the same message using the producer class and consume it via the IBM Cloud Function, the message is received as corrupted as described in the Slack conversation that Scott pointed to in his earlier comment.
Also note that in none of my implementations, I am specifying any encoding. I am relying on the default of UTF-8 as the encoding on both the producer as well as the consumer side and assuming that the underlying transport is bytes.
And we know OpenWhisk Actions can receive parameters don't corrupt the text. Should be pretty simple to validate.
Yes, I have already verified that and eliminated that when the OpenWhisk function/trigger is called directly (using curl) with the same payload, there is no corruption of data. Its only when the data is posted to the message hub topic and received by the whisk function (via this handler), we see corruption of data.
Confirming that there is no issue with neither the function nor the user provided IBM Message Hub trigger (calling the function) when invoked directly.
Here is the payload being sent each time:
diamond:kafka mmehra$ cat data.json { "greeting": "Howdy", "user": "Malalažbeta" }
diamond:kafka mmehra$ curl -u [XXXX:YYYYY] -d @data.json --header "Content-Type: application/json" -X POST https://openwhisk.ng.bluemix.net/api/v1/namespaces/XXXXX_dev/actions/greeting_package/say_hello?blocking=true {"duration":238,"name":"say_hello","subject":"XXXXX","activationId":"34c60d32186040c4860d321860a0c4e1","publish":false,"annotations":[{"key":"path","value":"XXXXX_dev/greeting_package/say_hello"},{"key":"waitTime","value":592},{"key":"kind","value":"nodejs:8"},{"key":"limits","value":{"timeout":60000,"memory":256,"logs":10}},{"key":"initTime","value":232}],"version":"0.0.1","response":{"result":{"Message":"Howdy, Malalažbeta","Status":"Success","Code":200},"success":true,"status":"success"},"end":1534427678991,"logs":[],"start":1534427678753,"namespace":"XXXXX_dev"} diamond:kafka mmehra$
Attached below is a screenshot of the curl call made to the user-provided trigger:
Attached below is a screenshot of the console log generated by the IBM Cloud function, showing what parameters it received in both cases:
@ScottChapman yes i agree. it is very confusing. especially the way python 2.7 handles strings and bytes.
@maneeshmehra thanks for providing that info. i'll do some testing on the provider end with your input using encode vs decode. my understanding of these builtin python functions is quite limited. i was going off of my understanding of the behavior, but as i said my understanding is quite limited and the way python handles things is rather confusing. a little testing should be able to clear things up though
thank you both for digging into this and providing feedback
You are most welcome. Please keep us posted via this issue on your findings.
Any reason not to run as python 3? Could normalize utf8 string handling.
The fix for this encoding issue has now been deployed into our production environment on Sep 4th through a support ticket.
Discovered that messages containing unicode text appear to be getting corrupted.
I suspect it is this: https://github.com/apache/incubator-openwhisk-package-kafka/blob/449bbae13e813ba4dcd11dc33f47ab29d5e3541a/provider/consumer.py#L455
From the kafka-python docs