aws / aws-sdk-js

AWS SDK for JavaScript in the browser and Node.js
https://aws.amazon.com/developer/language/javascript/
Apache License 2.0
7.59k stars 1.55k forks source link

S3 sometimes does not complete downloads #312

Closed andrewrk closed 10 years ago

andrewrk commented 10 years ago

I put together a test case for you:

var AWS = require('aws-sdk');
var fs = require('fs');
var crypto = require('crypto');

var client = new AWS.S3({
    accessKeyId: process.env.S3_KEY,
    secretAccessKey: process.env.S3_SECRET,
});

var s3Path = "1tmp1/node_modules/chem-cli/node_modules/watchify2/node_modules/browserify/node_modules/browser-builtins/node_modules/http-browserify/example/json-stream/node_modules/JSONStream/test/fixtures/all_npm.json";
var localFile = "test_out.json";
var s3Params = {
  Bucket: process.env.S3_BUCKET,
  Key: s3Path,
};
var count = 1;
downloadOnce();
function downloadOnce() {
  doTheDownload(function(err) {
    if (err) throw err;
    console.log("downloaded", count++);
    downloadOnce();
  });
}

function doTheDownload(cb) {
  var request = client.getObject(s3Params);
  var response = request.createReadStream();
  var outStream = fs.createWriteStream(localFile);
  var hash = crypto.createHash('md5');
  var errorOccurred = false;
  var eTag = "";

  response.on('error', handleError);
  outStream.on('error', handleError);

  request.on('httpHeaders', function(statusCode, headers, resp) {
    if (statusCode < 300) {
      var contentLength = parseInt(headers['content-length'], 10);
      eTag = headers.etag || "";
    } else {
      handleError(new Error("http status code " + statusCode));
    }
  });

  hash.on('data', function(digest) {
    if (!compareETag(eTag, digest)) {
      console.log("eTag", eTag, "digest", digest.toString('hex'), "path:", localFile);
      handleError(new Error("ETag does not match MD5 checksum"));
    }
  });

  outStream.on('close', function() {
    if (errorOccurred) return;
    cb();
  });

  response.pipe(outStream);
  response.pipe(hash);

  function handleError(err) {
    if (errorOccurred) return;
    errorOccurred = true;
    cb(err);
  }
}

function compareETag(eTag, md5Buffer) {
  eTag = eTag.replace(/^\s*'?\s*"?\s*(.*?)\s*"?\s*'?\s*$/, "$1");
  var hex = md5Buffer.toString('hex');
  return eTag === hex;
}

Edit s3Path variable to be some file on S3. My example file is 362KB.

Here's an example run:

$ S3_BUCKET='examplebucket' S3_KEY='examplekey' S3_SECRET='examplesecret' node test.js
downloaded 1
downloaded 2
downloaded 3
downloaded 4
downloaded 5
downloaded 6
downloaded 7
downloaded 8
downloaded 9
downloaded 10
downloaded 11
downloaded 12
downloaded 13
downloaded 14
downloaded 15
downloaded 16
downloaded 17
downloaded 18
downloaded 19
downloaded 20
downloaded 21
downloaded 22
downloaded 23
downloaded 24
downloaded 25
downloaded 26
downloaded 27
downloaded 28
downloaded 29
downloaded 30
downloaded 31
downloaded 32
downloaded 33
downloaded 34
downloaded 35
downloaded 36
downloaded 37
eTag "abac71913645791ed7a98e4461ee1a71" digest 70e838acf6649dc50bbac3e40272a674 path: test_out.json

/home/andy/tmp/npmtest/test.js:21
    if (err) throw err;
                   ^
Error: ETag does not match MD5 checksum
    at Hash.<anonymous> (/home/andy/tmp/npmtest/test.js:50:19)
    at Hash.emit (events.js:95:17)
    at Hash.<anonymous> (_stream_readable.js:748:14)
    at Hash.emit (events.js:92:17)
    at emitReadable_ (_stream_readable.js:410:10)
    at emitReadable (_stream_readable.js:406:5)
    at readableAddChunk (_stream_readable.js:168:9)
    at Hash.Readable.push (_stream_readable.js:130:10)
    at Hash.Transform.push (_stream_transform.js:140:32)
    at Hash._flush (crypto.js:201:8)

So the first 37 times it downloaded fine, but the 38th time we downloaded the file, AWS SDK said that the file was complete, when it really wasn't. If I go look at test_out.json on my computer, the first 356KB are downloaded correctly and then the file is truncated.

I am seeing this problem frequently.

lsegal commented 10 years ago

This is not an issue with the SDK but rather with general usage of S3. If you send many parallel requests, eventually S3 will throttle your access and kill your connection.

One way you can mitigate the failures is by using HTTP's Expect 100-continue header, which S3 supports:

request.on('build', function() {
  request.httpRequest.headers['Expect'] = '100-continue';
});

This will send headers and wait for a server response prior to attempting to upload/download any data, thereby allowing S3 to return with any errors prior to accepting your connection. If this is not passed, S3's only choice is to cut your connection off in the middle of transfer, causing the behavior you are seeing.

That said, you should generally expect that you may encounter abrupt disconnections for a variety of reasons (bad internet connection, routing issues, etc.), so you should design your application to recover from any download failures. The SDK has retry logic for most operations out of the box, which is why you generally don't see these failures, but retry logic is disabled for the raw createReadStream() operation, since it is just that: a raw low-level socket operation.

Hope that helps.

andrewrk commented 10 years ago

What's the best number of parallel requests to maximize bandwidth?

lsegal commented 10 years ago

The number will depend heavily on many different factors-- bandwidth, latency, I/O, and possibly even CPU. There is no exact value. An arbitrary value to start with might be 5, but you can do some tests to see what works best for you.

andrewrk commented 10 years ago

Wait wait wait. In this example code I am not sending any parallel requests. It is sending one request at a time. And I'm seeing this aborted request happening quite frequently.

Parallel requests is not an explanation for the behavior I'm seeing.

andrewrk commented 10 years ago

Also regarding retry logic - this example code doesn't have it but the real code does, and I was seeing a request sometimes fail three times in a row, each with a second delay before retrying. It should not be this spotty.

lsegal commented 10 years ago

Parallel requests are not the only way to hit S3's throttling limit, they are just an easier way. Downloading many files at once can also hit that limit.

If you add the Expect 100-continue header to your requests, do you see any error returned by S3?

andrewrk commented 10 years ago

Downloading many files at once can also hit that limit.

How is this different than "parallel requests"? Again, this test code downloads a single file at once, then repeats.

If you add the Expect 100-continue header to your requests, do you see any error returned by S3?

I observe no difference:

$ node test.js
downloaded 1
eTag "abac71913645791ed7a98e4461ee1a71" digest 76b63bd096db02bcc83a092c1d479307 path: test_out.json

/home/andy/tmp/npmtest/test.js:21
    if (err) throw err;
                   ^
Error: ETag does not match MD5 checksum

S3 still gives a 200 status code and then aborts the stream early.

I haven't done anything with S3 in many hours. I download a file once, and then once that is done I try to download it again and the download is aborted midway. There's no way I hit a real limit here.

lsegal commented 10 years ago

What version of Node.js are you using? I'm trying to reproduce this with no luck:

...
downloaded 3411
downloaded 3412
downloaded 3413
downloaded 3414
...

One thing I will note is that if S3 is actually killing the connection mid-way, this is likely not an SDK issue. If you are getting a 'close' event, that typically means Node.js itself is seeing the connection closing, not the SDK, which means the connection is being severed outside of the SDK's control.

It's possible the Node.js version is related-- it also is possible that the way the SDK is setting up the connection is incorrect, but without being able to reproduce, it's hard to say for sure.

andrewrk commented 10 years ago

What version of Node.js are you using?

v0.10.29

andrewrk commented 10 years ago

I'm confused as to why you're unable to duplicate the issue. It happens regularly and frequently for me. The most "downloaded nnn" number I've ever reached was 40.

For what it's worth, here's a slightly simplified test case that only checks file size and removes the etag/md5 calculation from the equation:

var AWS = require('aws-sdk');
var fs = require('fs');

var client = new AWS.S3({
    accessKeyId: process.env.S3_KEY,
    secretAccessKey: process.env.S3_SECRET,
});

var s3Path = "1tmp1/node_modules/chem-cli/node_modules/watchify2/node_modules/browserify/node_modules/browser-builtins/node_modules/http-browserify/example/json-stream/node_modules/JSONStream/test/fixtures/all_npm.json";
var localFile = "test_out.json";
var s3Params = {
  Bucket: process.env.S3_BUCKET,
  Key: s3Path,
};
var count = 1;
downloadOnce();
function downloadOnce() {
  doTheDownload(function(err) {
    if (err) throw err;
    console.log("downloaded", count++);
    downloadOnce();
  });
}

function doTheDownload(cb) {
  var request = client.getObject(s3Params);
  var response = request.createReadStream();
  var outStream = fs.createWriteStream(localFile);
  var errorOccurred = false;
  var contentLength;

  response.on('error', handleError);
  outStream.on('error', handleError);

  request.on('build', function() {
    request.httpRequest.headers.Expect = '100-continue';
  });

  request.on('httpHeaders', function(statusCode, headers, resp) {
    if (statusCode < 300) {
      contentLength = parseInt(headers['content-length'], 10);
    } else {
      handleError(new Error("http status code " + statusCode));
    }
  });

  outStream.on('close', function() {
    if (errorOccurred) return;
    fs.stat(localFile, function(err, stat) {
      if (err) return handleError(err);
      if (stat.size !== contentLength) return handleError(new Error("size mismatch"));
      cb();
    });
  });

  response.pipe(outStream);

  function handleError(err) {
    if (errorOccurred) return;
    errorOccurred = true;
    cb(err);
  }
}
...
downloaded 24
downloaded 25
downloaded 26
downloaded 27

/home/andy/tmp/npmtest/test.js:20
    if (err) throw err;
                   ^
Error: size mismatch
    at /home/andy/tmp/npmtest/test.js:54:59
    at Object.oncomplete (fs.js:107:15)
andrewrk commented 10 years ago

I'm also trying to do the same file download with s3cmd and it's working fine. I haven't seen a single one fail in 200+ downloads.

while true;
do rm -f test_out.json;
  s3cmd get s3://mybucket/1tmp1/node_modules/chem-cli/node_modules/watchify2/node_modules/browserify/node_modules/browser-builtins/node_modules/http-browserify/example/json-stream/node_modules/JSONStream/test/fixtures/all_npm.json test_out2.json;
  printf "%d" $?;
  md5sum test_out2.json;
done

I ran the node code simultaneously to this and I've reproduced several failures in the node code all the while s3cmd is humming away, no problem, no retries.

lsegal commented 10 years ago

I'm running your simplified case with the same results. I'm currently on 875 and it's going smoothly. I grabbed the same all_npm.json file you are using just to ensure that the data isn't the culprit, too.

Based on the code, I'm still not seeing anything pointing to the SDK. The 'close' event is coming from a plain ReadableStream object initiated by Node.js, not the SDK. If I could reproduce this on my end I could do some debugging to see if there's any discrepancy in how we forward certain events, but I'm not sure that would cause the socket to disconnect-- we are simply reading the events that come down off of the stream.

lsegal commented 10 years ago

Sorry I just noticed that pipe() is being used here, so I edited out my point about data/readable events. Node's pipe() should use the right thing.

I'm taking a look at using the HTTP stream directly. I'll post code for you to try in a bit.

andrewrk commented 10 years ago

What is different about our environments? Bucket? Node.js version? Network distance between local computer and S3 region? Operating system (Ubuntu 14.04 LTS 64-bit)?

This issue is perfectly reproducible for me. The culprit must be in the delta between our environments.

lsegal commented 10 years ago

I'm running on OSX. My bucket is in the classic region, and I'm on the west coast. Using Node.js 0.10.28. I'll try 0.10.29 in a sec here, I do recall some streaming related changes in that patchlevel release.

andrewrk commented 10 years ago

Related? https://github.com/joyent/node/commit/2efe4ab761666

(Note the comments at the bottom and https://github.com/mscdex/busboy/commit/8aea552aa478e351329b2b04fd5bbca22dae3375)

lsegal commented 10 years ago

Oh cool. I was able to reproduce this in 0.10.29. Have you tried 0.10.28 by any chance?

andrewrk commented 10 years ago

Nope, I upgraded to 0.10.29 as soon as it came out and never looked back :P

I'll try it.

lsegal commented 10 years ago

Related? joyent/node@2efe4ab

I don't think this would be related-- I don't think we ever call .read() with .on('data') events-- unless Node is doing this internally. But also, this would only hide an 'error' event, which would mean the error already occurred (not what we want to be seeing anyway).

lsegal commented 10 years ago

I can't seem to reproduce reliably (it got up to 200+ on the second attempt) in 0.10.29, but I've gotten it to fail more than a handful of times now. I can't seem to get it to fail at all in 0.10.28. I will spend some time looking into what changed between these versions, but I'm leaning to some unexpected regression in Node.js, unless we are using the API incorrectly, somehow.

andrewrk commented 10 years ago

I'm at 400+ with v0.10.28.

lsegal commented 10 years ago

There may be something related to https://github.com/joyent/node/commit/31150df92a5c5e726a4aa12b78a2589e7b20f1ec

lsegal commented 10 years ago

Although we are using 'end' instead of 'finish', which seems to be the right thing. I'm looking through https://github.com/joyent/node/compare/v0.10.28...v0.10.29 to see if anything stands out.

andrewrk commented 10 years ago

Cool, glad we're on the right track. I'm going back to v0.10.29 now. It is the latest "stable" version after all.

lsegal commented 10 years ago

I'm going back to v0.10.29 now.

Won't that cause the issue to reappear until this is fixed? It's still unclear if the issue is an SDK bug or Node.js regression. If this is a Node.js regression, 0.10.29 is probably not the right version to be on.

andrewrk commented 10 years ago

I've git bisected it to this commit: https://github.com/joyent/node/commit/c862c0348530824bfacef8eadfaf448120d91664

andrewrk commented 10 years ago

I tested v0.10.29 with that commit reverted and it fixes it. I've also tried setting sslEnabled to false in the test case and it fixes it.

lsegal commented 10 years ago

Thanks for the deep dive @andrewrk! I will do some checking to see why that bit us and if we were incorrectly depending on API behavior. It looks like there is something related to timing (async vs. sync), but I'm not entirely sure what just yet.

andrewrk commented 10 years ago

I tested node.js directly and it's fine. aws-sdk-js is using the streams API incorrectly. Here's an example of using the streams API correctly and it working: https://github.com/andrewrk/aws-sdk-js/commit/522fca46b708d820d11af55357526ce0ddf69fd6 (I have run the above test again with this code and it works fine).

andrewrk commented 10 years ago

Any update on this? Also is this related to #135 ?

jamessharp commented 10 years ago

I've just come up against this problem. I was trying to download a 12MB zip file and was getting incomplete files probably 90% of the time.

Downgrading to 0.10.28 has fixed it for now, but it's not ideal

lsegal commented 10 years ago

@andrewrk I've been looking into this. Your patch could work for 0.10.x, but unfortunately PassThrough is not available in 0.8.x, so it's not possible to take that in wholesale. We could switch over to use PassThrough in 0.10.x to fix this, but I'm not convinced that the SDK is using the streams API incorrectly. I created the following gist that makes a basic ReadableStream subclass using the public documentation on the _read() and read() APIs:

https://gist.github.com/lsegal/e6966a03bcbf077724d1

First half includes a paired down stream implementation, second half includes your original test code reworked to use this new impl. I get the same failures with this paired down implementation. Note: I renamed the long object name to just test_out.json for simplicity, and if you run this also make sure to make the object public-read since it's just a regular unsigned GET (aws s3 cp test_out.json s3://YOURBUCKET/test_out.json --acl public-read with the CLI).

Nothing in that code pops out at me as incorrectly used-- more importantly, all of the main pass-through implementation (see lines 34-50) is taken almost verbatim from Node's own documentation. If you can explain how this implementation is using the existing API incorrectly, we can adjust it, but otherwise I think there may be some kind of regression in the SDK.

I should also point out that this is not the only place we subclass Readable streams. We also use them in node.js for Buffer object uploads where PassThrough isn't as easy to drop-in, so if there is a regression here, it needs to be resolved in Node. I noticed that we also get periodic integration test failures on progress event tests, which is consistent with the issue.

FYI #135 is not related. That was the issue I thought this was initially related to when I thought this had to do with parallel downloads. That issue is specifically to do with the throttling issue that we already excluded from this.

lsegal commented 10 years ago

FWIW the actual issue here seems to be that readable events fire off right before the end event fires. You will see this behavior every time the download fails (multiple readable events quickly followed by end). It seems like The end event kills the reads before the stream is fully flushed-- that would be my best guess explanation.

I'm not entirely sure how the commit you found affects this behavior, and also why that happens, since those readable events should fire synchronously before end, but I definitely think that's the root cause of this issue.

andrewrk commented 10 years ago

On Jul 8, 2014 5:50 PM, "Loren Segal" notifications@github.com wrote:

@andrewrk I've been looking into this. Your patch could work for 0.10.x, but unfortunately PassThrough is not available in 0.8.x, so it's not possible to take that in wholesale.

This is what the readable-stream module is for. Code your app for 0.10.x and use the readable-stream for PassThrough and friends in 0.8.x

I'm not convinced that the SDK is using the streams API incorrectly.

This statement concerns me.

Specifically you should not iterate over the events and blithely re-emit them.

If the tests are fixed now I can provide a patch that works for both 0.8 and 0.10 using readable-stream.

lsegal commented 10 years ago

Specifically you should not iterate over the events and blithely re-emit them.

The events aren't exactly being blindly emitted. You'll notice that in the sample, all of the events are actually listened to. In all usages of pipe(), both readable and end would need forwarding-- forwarding error should not have an effect, as it's only emitted if something actually goes wrong, which is exactly what we are trying to avoid. Also, changing this behavior has no effect on the outcome of the sample-- here's an updated gist that only tracks monitored events:

https://gist.github.com/lsegal/e6966a03bcbf077724d1 (diff)

If the tests are fixed now I can provide a patch that works for both 0.8 and 0.10 using readable-stream.

The tests have been working since Friday with the workaround provided in #313, and fixed today. I'd be happy to take a patch, but note that working around this is not the only concern-- understanding what the actual contract for Readable streams is. As I mentioned, it would be easy enough to switch to PassThrough and use something like readable-stream (which only became stable long after we implemented createReadStream).

andrewrk commented 10 years ago

I just looked at your gist. This code is no good:

      Object.keys(stream._events).forEach(function(event) {
        resp.on(event, function(arg) {
          console.log(event);
          stream.emit(event, arg);
        });
      });

Nowhere in the API does it say you can read the _events property. That's a private field.

  stream.on('newListener', function(event) {
    if (!sent && event === 'readable') {
      sent = true;
      process.nextTick(req.send);
    }
  });

This is scary, error prone, and unnecessary.

        // Taken from:
        // http://nodejs.org/api/stream.html#stream_example_simpleprotocol_v1_sub_optimal
        // if the source doesn't have data, we don't have data yet.
        if (chunk === null) stream.push('');

This hack is unnecessary if you use the API correctly.

Just pipe everything and let Node.js handle all the tricky details.

In all usages of pipe(), both readable and end would need forwarding

This is trivially proven wrong:

var fs = require('fs');
var inStream = fs.createReadStream("blah.txt");

var PassThrough = require('stream').PassThrough;

var outStream = new PassThrough();
outStream.on('readable', function() {
  console.log("got readable");
});
outStream.on('end', function() {
  console.log("got end");
});
inStream.pipe(outStream);
var writeStream = fs.createWriteStream("blah-out.txt");
outStream.pipe(writeStream);

Output:

got readable
got end

The code example I gave you earlier is simple and effective. Delete all the weird hacks, just use pipe, and everything is fixed.

There's no bug in Node.js. The mystery has been solved. The bug is in aws-sdk-js. Please fix your code.

lsegal commented 10 years ago

Thanks for your patience on this @andrewrk and @jamessharp. The above commit is the shortest path fix to get things working. I plan on doing a refactor to use PassThrough with readable-stream for backward compatibility support for 0.8.x, but a first pass broke a few tests due to some of our mocks being a little too brittle for the change.

andrewrk commented 10 years ago

Thanks for solving the issue :)

Looking forward to the next release.

lsegal commented 10 years ago

v2.0.6 is now out with these changes.

andrewrk commented 10 years ago

:thumbsup:

ybogdanov commented 9 years ago

aws-sdk@2.0.17 is also breaks in between node 0.10.26 and 0.10.31. Can't provide a lot of details on that, didn't have time to reproduce it carefully. Just rolled back to 0.10.26 to keep everything working. I'm doing HEAD requests on S3 (about few hundred in a row), narrowing them by 50 maximum in parallel. Everything was smooth, without any errors or timeouts, until we updated nodejs to 0.10.31. Some of requests (about 5%) started failing. Well, actually they were not failing - the callback simply didn't fire. However, as S3 access logs shown, those requests were actually there and response code was 200.


var params = {
  Bucket: 'my_bucket',
  Key: 'my_key'
}

s3.headObject(params, function(err, resp) {
  console.log('you will never see me!')
})

Maybe the problem is some way related to the current issue. So I'll just leave it here.

lsegal commented 9 years ago

@ybogdanov this issue was specifically related to the use of streaming APIs in the SDK (createReadStream in particular). The headObject operation doesn't use any streams, just the regular http interface. If callbacks aren't firing this would likely be related to something else, though off the top of my head I'm not sure what. Your report perhaps should be moved into its own issue so we can track it.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.