Closed andrewrk closed 10 years ago
This is not an issue with the SDK but rather with general usage of S3. If you send many parallel requests, eventually S3 will throttle your access and kill your connection.
One way you can mitigate the failures is by using HTTP's Expect 100-continue header, which S3 supports:
request.on('build', function() {
request.httpRequest.headers['Expect'] = '100-continue';
});
This will send headers and wait for a server response prior to attempting to upload/download any data, thereby allowing S3 to return with any errors prior to accepting your connection. If this is not passed, S3's only choice is to cut your connection off in the middle of transfer, causing the behavior you are seeing.
That said, you should generally expect that you may encounter abrupt disconnections for a variety of reasons (bad internet connection, routing issues, etc.), so you should design your application to recover from any download failures. The SDK has retry logic for most operations out of the box, which is why you generally don't see these failures, but retry logic is disabled for the raw createReadStream()
operation, since it is just that: a raw low-level socket operation.
Hope that helps.
What's the best number of parallel requests to maximize bandwidth?
The number will depend heavily on many different factors-- bandwidth, latency, I/O, and possibly even CPU. There is no exact value. An arbitrary value to start with might be 5, but you can do some tests to see what works best for you.
Wait wait wait. In this example code I am not sending any parallel requests. It is sending one request at a time. And I'm seeing this aborted request happening quite frequently.
Parallel requests is not an explanation for the behavior I'm seeing.
Also regarding retry logic - this example code doesn't have it but the real code does, and I was seeing a request sometimes fail three times in a row, each with a second delay before retrying. It should not be this spotty.
Parallel requests are not the only way to hit S3's throttling limit, they are just an easier way. Downloading many files at once can also hit that limit.
If you add the Expect 100-continue header to your requests, do you see any error returned by S3?
Downloading many files at once can also hit that limit.
How is this different than "parallel requests"? Again, this test code downloads a single file at once, then repeats.
If you add the Expect 100-continue header to your requests, do you see any error returned by S3?
I observe no difference:
$ node test.js
downloaded 1
eTag "abac71913645791ed7a98e4461ee1a71" digest 76b63bd096db02bcc83a092c1d479307 path: test_out.json
/home/andy/tmp/npmtest/test.js:21
if (err) throw err;
^
Error: ETag does not match MD5 checksum
S3 still gives a 200 status code and then aborts the stream early.
I haven't done anything with S3 in many hours. I download a file once, and then once that is done I try to download it again and the download is aborted midway. There's no way I hit a real limit here.
What version of Node.js are you using? I'm trying to reproduce this with no luck:
...
downloaded 3411
downloaded 3412
downloaded 3413
downloaded 3414
...
One thing I will note is that if S3 is actually killing the connection mid-way, this is likely not an SDK issue. If you are getting a 'close' event, that typically means Node.js itself is seeing the connection closing, not the SDK, which means the connection is being severed outside of the SDK's control.
It's possible the Node.js version is related-- it also is possible that the way the SDK is setting up the connection is incorrect, but without being able to reproduce, it's hard to say for sure.
What version of Node.js are you using?
v0.10.29
I'm confused as to why you're unable to duplicate the issue. It happens regularly and frequently for me. The most "downloaded nnn" number I've ever reached was 40.
For what it's worth, here's a slightly simplified test case that only checks file size and removes the etag/md5 calculation from the equation:
var AWS = require('aws-sdk');
var fs = require('fs');
var client = new AWS.S3({
accessKeyId: process.env.S3_KEY,
secretAccessKey: process.env.S3_SECRET,
});
var s3Path = "1tmp1/node_modules/chem-cli/node_modules/watchify2/node_modules/browserify/node_modules/browser-builtins/node_modules/http-browserify/example/json-stream/node_modules/JSONStream/test/fixtures/all_npm.json";
var localFile = "test_out.json";
var s3Params = {
Bucket: process.env.S3_BUCKET,
Key: s3Path,
};
var count = 1;
downloadOnce();
function downloadOnce() {
doTheDownload(function(err) {
if (err) throw err;
console.log("downloaded", count++);
downloadOnce();
});
}
function doTheDownload(cb) {
var request = client.getObject(s3Params);
var response = request.createReadStream();
var outStream = fs.createWriteStream(localFile);
var errorOccurred = false;
var contentLength;
response.on('error', handleError);
outStream.on('error', handleError);
request.on('build', function() {
request.httpRequest.headers.Expect = '100-continue';
});
request.on('httpHeaders', function(statusCode, headers, resp) {
if (statusCode < 300) {
contentLength = parseInt(headers['content-length'], 10);
} else {
handleError(new Error("http status code " + statusCode));
}
});
outStream.on('close', function() {
if (errorOccurred) return;
fs.stat(localFile, function(err, stat) {
if (err) return handleError(err);
if (stat.size !== contentLength) return handleError(new Error("size mismatch"));
cb();
});
});
response.pipe(outStream);
function handleError(err) {
if (errorOccurred) return;
errorOccurred = true;
cb(err);
}
}
...
downloaded 24
downloaded 25
downloaded 26
downloaded 27
/home/andy/tmp/npmtest/test.js:20
if (err) throw err;
^
Error: size mismatch
at /home/andy/tmp/npmtest/test.js:54:59
at Object.oncomplete (fs.js:107:15)
I'm also trying to do the same file download with s3cmd and it's working fine. I haven't seen a single one fail in 200+ downloads.
while true;
do rm -f test_out.json;
s3cmd get s3://mybucket/1tmp1/node_modules/chem-cli/node_modules/watchify2/node_modules/browserify/node_modules/browser-builtins/node_modules/http-browserify/example/json-stream/node_modules/JSONStream/test/fixtures/all_npm.json test_out2.json;
printf "%d" $?;
md5sum test_out2.json;
done
I ran the node code simultaneously to this and I've reproduced several failures in the node code all the while s3cmd is humming away, no problem, no retries.
I'm running your simplified case with the same results. I'm currently on 875 and it's going smoothly. I grabbed the same all_npm.json file you are using just to ensure that the data isn't the culprit, too.
Based on the code, I'm still not seeing anything pointing to the SDK. The 'close' event is coming from a plain ReadableStream object initiated by Node.js, not the SDK. If I could reproduce this on my end I could do some debugging to see if there's any discrepancy in how we forward certain events, but I'm not sure that would cause the socket to disconnect-- we are simply reading the events that come down off of the stream.
Sorry I just noticed that pipe() is being used here, so I edited out my point about data/readable events. Node's pipe() should use the right thing.
I'm taking a look at using the HTTP stream directly. I'll post code for you to try in a bit.
What is different about our environments? Bucket? Node.js version? Network distance between local computer and S3 region? Operating system (Ubuntu 14.04 LTS 64-bit)?
This issue is perfectly reproducible for me. The culprit must be in the delta between our environments.
I'm running on OSX. My bucket is in the classic region, and I'm on the west coast. Using Node.js 0.10.28. I'll try 0.10.29 in a sec here, I do recall some streaming related changes in that patchlevel release.
Related? https://github.com/joyent/node/commit/2efe4ab761666
(Note the comments at the bottom and https://github.com/mscdex/busboy/commit/8aea552aa478e351329b2b04fd5bbca22dae3375)
Oh cool. I was able to reproduce this in 0.10.29. Have you tried 0.10.28 by any chance?
Nope, I upgraded to 0.10.29 as soon as it came out and never looked back :P
I'll try it.
Related? joyent/node@2efe4ab
I don't think this would be related-- I don't think we ever call .read()
with .on('data')
events-- unless Node is doing this internally. But also, this would only hide an 'error' event, which would mean the error already occurred (not what we want to be seeing anyway).
I can't seem to reproduce reliably (it got up to 200+ on the second attempt) in 0.10.29, but I've gotten it to fail more than a handful of times now. I can't seem to get it to fail at all in 0.10.28. I will spend some time looking into what changed between these versions, but I'm leaning to some unexpected regression in Node.js, unless we are using the API incorrectly, somehow.
I'm at 400+ with v0.10.28.
There may be something related to https://github.com/joyent/node/commit/31150df92a5c5e726a4aa12b78a2589e7b20f1ec
Although we are using 'end' instead of 'finish', which seems to be the right thing. I'm looking through https://github.com/joyent/node/compare/v0.10.28...v0.10.29 to see if anything stands out.
Cool, glad we're on the right track. I'm going back to v0.10.29 now. It is the latest "stable" version after all.
I'm going back to v0.10.29 now.
Won't that cause the issue to reappear until this is fixed? It's still unclear if the issue is an SDK bug or Node.js regression. If this is a Node.js regression, 0.10.29 is probably not the right version to be on.
I've git bisected it to this commit: https://github.com/joyent/node/commit/c862c0348530824bfacef8eadfaf448120d91664
I tested v0.10.29 with that commit reverted and it fixes it. I've also tried setting sslEnabled
to false
in the test case and it fixes it.
Thanks for the deep dive @andrewrk! I will do some checking to see why that bit us and if we were incorrectly depending on API behavior. It looks like there is something related to timing (async vs. sync), but I'm not entirely sure what just yet.
I tested node.js directly and it's fine. aws-sdk-js is using the streams API incorrectly. Here's an example of using the streams API correctly and it working: https://github.com/andrewrk/aws-sdk-js/commit/522fca46b708d820d11af55357526ce0ddf69fd6 (I have run the above test again with this code and it works fine).
Any update on this? Also is this related to #135 ?
I've just come up against this problem. I was trying to download a 12MB zip file and was getting incomplete files probably 90% of the time.
Downgrading to 0.10.28 has fixed it for now, but it's not ideal
@andrewrk I've been looking into this. Your patch could work for 0.10.x, but unfortunately PassThrough is not available in 0.8.x, so it's not possible to take that in wholesale. We could switch over to use PassThrough in 0.10.x to fix this, but I'm not convinced that the SDK is using the streams API incorrectly. I created the following gist that makes a basic ReadableStream subclass using the public documentation on the _read()
and read()
APIs:
https://gist.github.com/lsegal/e6966a03bcbf077724d1
First half includes a paired down stream implementation, second half includes your original test code reworked to use this new impl. I get the same failures with this paired down implementation. Note: I renamed the long object name to just test_out.json
for simplicity, and if you run this also make sure to make the object public-read
since it's just a regular unsigned GET (aws s3 cp test_out.json s3://YOURBUCKET/test_out.json --acl public-read
with the CLI).
Nothing in that code pops out at me as incorrectly used-- more importantly, all of the main pass-through implementation (see lines 34-50) is taken almost verbatim from Node's own documentation. If you can explain how this implementation is using the existing API incorrectly, we can adjust it, but otherwise I think there may be some kind of regression in the SDK.
I should also point out that this is not the only place we subclass Readable streams. We also use them in node.js for Buffer object uploads where PassThrough isn't as easy to drop-in, so if there is a regression here, it needs to be resolved in Node. I noticed that we also get periodic integration test failures on progress event tests, which is consistent with the issue.
FYI #135 is not related. That was the issue I thought this was initially related to when I thought this had to do with parallel downloads. That issue is specifically to do with the throttling issue that we already excluded from this.
FWIW the actual issue here seems to be that readable
events fire off right before the end
event fires. You will see this behavior every time the download fails (multiple readable
events quickly followed by end
). It seems like The end
event kills the reads before the stream is fully flushed-- that would be my best guess explanation.
I'm not entirely sure how the commit you found affects this behavior, and also why that happens, since those readable
events should fire synchronously before end
, but I definitely think that's the root cause of this issue.
On Jul 8, 2014 5:50 PM, "Loren Segal" notifications@github.com wrote:
@andrewrk I've been looking into this. Your patch could work for 0.10.x, but unfortunately PassThrough is not available in 0.8.x, so it's not possible to take that in wholesale.
This is what the readable-stream module is for. Code your app for 0.10.x and use the readable-stream for PassThrough and friends in 0.8.x
I'm not convinced that the SDK is using the streams API incorrectly.
This statement concerns me.
Specifically you should not iterate over the events and blithely re-emit them.
If the tests are fixed now I can provide a patch that works for both 0.8 and 0.10 using readable-stream.
Specifically you should not iterate over the events and blithely re-emit them.
The events aren't exactly being blindly emitted. You'll notice that in the sample, all of the events are actually listened to. In all usages of pipe()
, both readable
and end
would need forwarding-- forwarding error
should not have an effect, as it's only emitted if something actually goes wrong, which is exactly what we are trying to avoid. Also, changing this behavior has no effect on the outcome of the sample-- here's an updated gist that only tracks monitored events:
https://gist.github.com/lsegal/e6966a03bcbf077724d1 (diff)
If the tests are fixed now I can provide a patch that works for both 0.8 and 0.10 using readable-stream.
The tests have been working since Friday with the workaround provided in #313, and fixed today. I'd be happy to take a patch, but note that working around this is not the only concern-- understanding what the actual contract for Readable streams is. As I mentioned, it would be easy enough to switch to PassThrough and use something like readable-stream (which only became stable long after we implemented createReadStream
).
I just looked at your gist. This code is no good:
Object.keys(stream._events).forEach(function(event) {
resp.on(event, function(arg) {
console.log(event);
stream.emit(event, arg);
});
});
Nowhere in the API does it say you can read the _events
property. That's a private field.
stream.on('newListener', function(event) {
if (!sent && event === 'readable') {
sent = true;
process.nextTick(req.send);
}
});
This is scary, error prone, and unnecessary.
// Taken from:
// http://nodejs.org/api/stream.html#stream_example_simpleprotocol_v1_sub_optimal
// if the source doesn't have data, we don't have data yet.
if (chunk === null) stream.push('');
This hack is unnecessary if you use the API correctly.
Just pipe everything and let Node.js handle all the tricky details.
In all usages of pipe(), both readable and end would need forwarding
This is trivially proven wrong:
var fs = require('fs');
var inStream = fs.createReadStream("blah.txt");
var PassThrough = require('stream').PassThrough;
var outStream = new PassThrough();
outStream.on('readable', function() {
console.log("got readable");
});
outStream.on('end', function() {
console.log("got end");
});
inStream.pipe(outStream);
var writeStream = fs.createWriteStream("blah-out.txt");
outStream.pipe(writeStream);
Output:
got readable
got end
The code example I gave you earlier is simple and effective. Delete all the weird hacks, just use pipe, and everything is fixed.
There's no bug in Node.js. The mystery has been solved. The bug is in aws-sdk-js. Please fix your code.
Thanks for your patience on this @andrewrk and @jamessharp. The above commit is the shortest path fix to get things working. I plan on doing a refactor to use PassThrough with readable-stream for backward compatibility support for 0.8.x, but a first pass broke a few tests due to some of our mocks being a little too brittle for the change.
Thanks for solving the issue :)
Looking forward to the next release.
v2.0.6 is now out with these changes.
:thumbsup:
aws-sdk@2.0.17
is also breaks in between node 0.10.26
and 0.10.31
. Can't provide a lot of details on that, didn't have time to reproduce it carefully. Just rolled back to 0.10.26
to keep everything working.
I'm doing HEAD requests on S3 (about few hundred in a row), narrowing them by 50 maximum in parallel. Everything was smooth, without any errors or timeouts, until we updated nodejs to 0.10.31
. Some of requests (about 5%) started failing. Well, actually they were not failing - the callback simply didn't fire. However, as S3 access logs shown, those requests were actually there and response code was 200.
var params = {
Bucket: 'my_bucket',
Key: 'my_key'
}
s3.headObject(params, function(err, resp) {
console.log('you will never see me!')
})
Maybe the problem is some way related to the current issue. So I'll just leave it here.
@ybogdanov this issue was specifically related to the use of streaming APIs in the SDK (createReadStream
in particular). The headObject
operation doesn't use any streams, just the regular http interface. If callbacks aren't firing this would likely be related to something else, though off the top of my head I'm not sure what. Your report perhaps should be moved into its own issue so we can track it.
I put together a test case for you:
Edit
s3Path
variable to be some file on S3. My example file is 362KB.Here's an example run:
So the first 37 times it downloaded fine, but the 38th time we downloaded the file, AWS SDK said that the file was complete, when it really wasn't. If I go look at test_out.json on my computer, the first 356KB are downloaded correctly and then the file is truncated.
I am seeing this problem frequently.