apache / openwhisk-package-kafka

Apache OpenWhisk package for communicating with Kafka or Message Hub
https://openwhisk.apache.org/
Apache License 2.0
33 stars 43 forks source link

Binary Encoding causes Corruption #269

Open maximann opened 6 years ago

maximann commented 6 years ago

It looks to me like https://github.com/apache/incubator-openwhisk-package-kafka/blob/449bbae13e813ba4dcd11dc33f47ab29d5e3541a/provider/consumer.py#L455 encodes all values as UTF-8 before base64. My understanding is that some byte values are not valid UTF-8, thus corruption can occur at the line above. ISO-8859-1 seems like the better choice. I'd be happy to file a bug if this seems like a reasonable hypothesis and if someone could point me to where to do that. I was also wondering if someone has encountered a similar issue before, especially since binary serializers such as Avro are popular for Kafka.

Slack Transcript: id you try the with the parameter “isBinaryValue” set to true ?

Max [18 hours ago] Yes

csantanapr [18 hours ago] I believe the intention is to take the binary data encode base64 to be able to be sent to the controller http rest api

csantanapr [18 hours ago] The controller will take the base64 in the json, passed it down thru kafka, invoker sends the base64 data in a http post request to the function runtime, the function receives the application/json with the this field base64 data

Max [18 hours ago] Right, that makes sense. But my understanding is all of this happens after binary data was consumed in the provider code linked above, so the corrupted data gets passed along just fine and makes it to the invocation correctly. But when binary messages are consumed from the external kafka cluster (the cluster which the trigger is setup for), and encoded as a UTF-8 strings before being passed along to the controller, the corruption has already happened. Or at least that's what I think is happening.

csantanapr [17 hours ago] I think we saw a problem in practice we found messages had corrupted data and the line 456 is what tries to attempt to clean

Max [17 hours ago] Definitely seeing that in my logs: [2018-06-06T23:24:13.040Z] [WARNING] [??] [kafkatriggers] [/core/qmpTrigger] Value contains non-unicode bytes. Replacing invalid bytes.

Max [17 hours ago] If you're ok with it I'll try to fix this for binary formats such as Avro tomorrow and send a diff your way.

csantanapr [17 hours ago] You mean don’t do encode utf-8 when isBinaryValue=true ?

csantanapr [17 hours ago] My concern is current users how are they impacted in production with the change

csantanapr [17 hours ago] When they proceess the input value in the function and they decode base64 they get UTF-8 data and work on that data. (edited)

csantanapr [17 hours ago] With this change then their value data will not be UTF-8 and they functions would start to fail unexpectedly?

csantanapr [17 hours ago] I’m all in for improving the code but no throwing the current customers under the bus, oops I meant users :smiley_cat:

csantanapr [17 hours ago] Open and issue on the kafka repo and submit a PR so we can discuss more in depth the details with others

Max [17 hours ago] Sounds good, will do

Max [17 hours ago] Thanks

Max [17 hours ago] (I agree, valid concerns) (edited)

csantanapr [17 hours ago] Maybe we need an extra parameter, or maybe is compatible or we try one and then other one think about it

csantanapr [17 hours ago] Or deprecate and use new method, or handle triggers already created leave them alone etc...

csantanapr [17 hours ago] Those are the type of things I also want to include as part of the fix and companion tests how roll it out to production

csantanapr [17 hours ago] This is Awesome that people are coming forward to contribute to event feeds +1 cc @dubee

Max [17 hours ago] FWIW, I doubt that true binary works at all. The issue is explained here as well: https://haacked.com/archive/2012/01/30/hazards-of-converting-binary-data-to-a-string.aspx/ You’ve Been Haacked Hazards of Converting Binary Data To A String Back in November, someone asked a question on StackOverflow about converting arbitrary binary data (in the form of a byte array) to a string. I know this because I make it a habit to read randomly selected questions in StackOverflow written in November 2011. Questions about text encodings in particular really turn me on. Jan 29th, 2012 at 4:00 PM (edited)

csantanapr [17 hours ago] You might be correct

csantanapr [17 hours ago] We just need something that can be safely pass thru http application/json

csantanapr [17 hours ago] That’s why my explanation on all the touch points that data gets touch and transform

csantanapr [17 hours ago] In the issue we can ping @jberstler he was the original author that dealt with binary use case.

maximann commented 6 years ago

I will submit a PR shortly.

rabbah commented 6 years ago

@maximann fork the repo, then you can create branches in your own fork, and submit a pull request.