Getting encoding artifacts and tapered objects in S3

bioball commented 9 years ago

I'm currently using the knox client to put JSON blobs into S3, but for some reason they're getting encoded incorrectly, and tapered. For example, the text in S3 might end up looking like this.

{"message": "\"hello\"Ã‚Â"

Here's my code. This shouldn't matter, but I'm extending the knox client with a putObject method, and am using bluebird to create and return a promise.

What am I doing wrong here?

knox.prototype.putObject = (obj, filename) ->
  new Promise (resolve, reject) =>
    if typeof obj isnt "object"
      throw new Error("did not receive type object for putObject request: #{ obj }")
    string = JSON.stringify(obj)
    req = @put(filename, "Content-Length": string.length, "Content-Type": "application/json")
    req.end(string)

    req.on 'response', (res) ->
      if res.statusCode is 200
        resolve(res)
      else
        reject(res)

domenic commented 9 years ago

That code is not valid JavaScript, so I'm not sure how you expect it to work or me to debug it.

bioball commented 9 years ago

Sorry, that's coffeescript. Here is the same code again in JavaScript.

knox.prototype.putObject = function(obj, filename){
  var self = this;
  return new Promise(function(resolve, reject){
    if (typeof obj !== "object") {
      throw new Error("did not receive type object for putObject request: " + obj);
    }
    var string = JSON.stringify(obj);
    var req = self.put(filename, {
      "Content-Length": string.length,
      "Content-Type": "application/json"
    });
    req.end(string);

    req.on('response', function(res){
      if (res.statusCode === 200) {
        resolve(res);
      } else {
        reject(res);
      }
    });
  });
};

domenic commented 9 years ago

You are not passing an encoding to req.end. Try req.end(string, "utf8").

Also if you expect to view it in the browser you probably also want the header Content-Encoding: utf-8.

bioball commented 9 years ago

Ah, awesome. Can you add this to your docs? I'm thinking other people will run into the same issue that I did. I can PR if you'd like.

domenic commented 9 years ago

Yeah, PR welcome. I think the "utf8" to .end() might be the default and it's only the content-encoding that matters? Worth testing.

bioball commented 9 years ago

Ok, so setting a "Content-Encoding" header doesn't change anything, and neither does passing an encoding into req.end, I'm guessing because they both already default to utf-8. I'm testing this by using knox to put payloads with chinese characters, and each time, 你好 gets turned into ä½ å¥½ when viewed directly from S3.

Ultimately, I fixed my problem by calculating the length of the buffer of my string, instead of the string itself. When I read it back out of S3, I'm using response.setEncoding('utf8'), which converts these artifacts back into the original utf-8 string.

Here's what my code looks like now

knox.prototype.putObject = function(obj, filename){
  var self = this;
  return new Promise(function(resolve, reject){
    if (typeof obj !== "object") {
      throw new Error("did not receive type object for putObject request: " + obj);
    }
    var string = JSON.stringify(obj);
    var req = self.put(filename, {
      "Content-Length": new Buffer(string).length,
      "Content-Type": "application/json"
    });
    req.end(string);

    req.on('response', function(res){
      if (res.statusCode === 200) {
        resolve(res);
      } else {
        reject(res);
      }
    });
  });
};

I'll submit a PR for this

domenic commented 9 years ago

Ultimately, I fixed my problem by calculating the length of the buffer of my string, instead of the string itself.

Ah, good call! Yes, it's easy to make this mistake. You can use Buffer.byteLength for this to avoid creating a garbage buffer. Or you can create the buffer, and write it directly, instead of writing the string.

bioball commented 9 years ago

Cool, I'll do that instead.

Automattic / knox

Getting encoding artifacts and tapered objects in S3 #260