s3.getObject(params).createReadStream() Timeouts

bebu259 commented 6 years ago

I have a node-express application with a GET route that serves the download of a zip archive. The zip file is created on-the-fly when the request starts. To zip all requested files into one archive I use node-archiver. The single files are streamed from an AWS S3 bucket and the zipped archive is piped into the response.

I use: aws-sdk version: 2.256.1 express version: 4.16.3 node version: 8.3.0

The code:

export async function downloadArchive(req, res, next) {
  const { archiveContent } = await getArchiveContent(req.params.id);

  // --- create archive ---

  const archive = archiver('zip');

  archive
    .on('error', (err) => {
      console.log('archive error', err);
    })
    .on('entry', entry => {
      console.log('archive event: entry ', entry.name);
    })
    .on('progress', data => {
      console.log('archive event: progress', data);
    })
    .on('data', data => {
      console.log(`archive event: data | processed total ${(archive.pointer() / (1024 ** 2)).toFixed(3)} MB | data length: ${data.length}`);
    });

  res.setHeader('Content-disposition', `attachment; filename*=UTF-8''content-archive.zip; filename=content-archive.zip`);
  res.setHeader('Content-type', 'application/zip');

  archive.pipe(res);

  // --- create all S3 read streams and append them to the archive ---

  for (const entry of archiveContent) {
    console.log(`create stream for ${entry.nameInArchive}`);

    const stream = s3
      .getObject({ Bucket: S3_BUCKET_NAME, Key: entry.path })
      .createReadStream()
      .on('error', err => {
        console.log('stream error', err);
      })
      .on('finish', () => {
        console.log('stream finish');
      })
      .on('close', () => {
        console.log('stream close');
      });

    // append read stream to archive:
    archive.append(stream, { name: entry.nameInArchive });
  }

  // --- response events ---

  res
    .on('error', err => {
      console.log('res error', err);
    })
    .on('finish', () => {
      console.log('res finish');
    })
    .on('close', () => {
      console.log('res close');
      archive.abort();
    });

  archive.pipe(res);
  archive.finalize();
}

The code works but there are some things I do not quite understand:

1. Receiving Timeouts after read streams have finished

When the download finishes sucessfully, finish event is triggered on all read streams. But 120 seconds later, there is an error event triggered on each read stream: TimeoutError: Connection timed out after 120000ms. Is this expected behavior or do I have to disconnect all read streams manually?

2. A new requests get stuck when previous request was canceled

The scenario is like this:

the user starts a request for zipping a lot of files (let's say 200)
after the download has started, the user cancels the download in his browser
a close event is triggered on the response and the archive stream is closed
then the user starts the same request for a second time
all 200 read streams are created in this second request again
but the archive get's stuck and no more bytes are read from the read streams

I tried to figure out, what happens here... Maybe the first 200 open connections of the canceled read streams are the cause...? When I do this scenario with small numbers of files (<10) each new request after a canceled request works well.

My questions:

is there a limit of open connections to s3 objects?
do I have to / how can I close connections (that are not needed anymore)?
in general (and maybe more a node.js performance question): is good practice to open hundreds of streams at once?

yousharizvi commented 6 years ago

Any catch on this issue? I'm having exactly same scenario and facing the same issue, that is on some streams I receive timeout and eventually the download breaks.

geekguy commented 5 years ago

Facing a similar problem. Is there any update on this?

JarrodAMcEvers commented 5 years ago

+1

tangleMesh commented 5 years ago

+1

kononenko-a-m commented 5 years ago

@AllanFly120 does someone really works on it?

yousharizvi commented 5 years ago

I found a workaround by instantiating a new instance of s3 for every read request i.e., avoiding singleton for s3 instance. Basically it was choking the s3 instance on my end. Try doing aws.s3() before every getObject() call i.e., aws.s3().getObject(params).createReadStream(). Hope that'll help.

tangleMesh commented 5 years ago

Didn't worked for me

I found a workaround by instantiating a new instance of s3 for every read request i.e., avoiding singleton for s3 instance. Basically it was choking the s3 instance on my end. Try doing aws.s3() before every getObject() call i.e., aws.s3().getObject(params).createReadStream(). Hope that'll help.

Didn't worked for me. The same timeout error appears :/

ptcc commented 5 years ago

I have the same issue: timeout after 120s even if the stream finished successfully

jeffbski commented 5 years ago

I believe this is the httpTimeout which defaults to 120000ms

Increase the httpTimeout by passing in options to whatever AWS API you are using, so if using S3

const awsOptions = {
  region: 'us-east-1',
  httpOptions: {
    timeout: 900000 // 15 minutes
  }
};
const s3 = new AWS.S3(awsOptions);

Use whatever timeout makes sense for your situation.

jeffbski commented 5 years ago

Also I found that if it is timing out for your from invoking it via cli locally, you should set the env variable AWS_CLIENT_TIMEOUT to override the default 120000ms timeout it uses.

export AWS_CLIENT_TIMEOUT=900000
sls invoke -f helllo

ptcc commented 5 years ago

@jeffbski this is not about the timeout being insufficient. I get this error even when the request is quickly handled

jeffbski commented 5 years ago

Well different issue than what I was experiencing then. Mine went away with these changes.

On Tue, Feb 26, 2019 at 6:11 AM Tiago Coelho notifications@github.com wrote:

@jeffbski https://github.com/jeffbski this is not about the timeout being insufficient. I get this error even when the request is quickly handled

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aws/aws-sdk-js/issues/2087#issuecomment-467414603, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAWOSqLSyV_0E_UvpCv1TvyWZ75HmWuks5vRSRUgaJpZM4UmWi3 .

-- Jeff Barczewski Founder of CodeWinds http://codewinds.com/ Live and Online developer training

0xtmphey commented 5 years ago

+1 The same for me. Does anybody know if that’s happening using some language, say java?

alexandruluca commented 5 years ago

Any follow up on this? I'm in the exact same situation as the op

alexandruluca commented 5 years ago

@bebu259 Did you manage to find a solution to your problem?

bebu259 commented 5 years ago

@alexandruluca I found a solution by creating the streams in a lazy way using PassThrough streams.

The idea is to create each single stream when a first listener registers on that stream. So in practice there are only a few streams existing at the same time, not hundreds of streams.

The code that solved my initial question looks like this:

import { PassThrough } from 'stream';

// ...

const getStream = (path) => {
  let streamCreated = false;

  const passThroughStream = new PassThrough();

  passThroughStream.on('newListener', event => {
    if (!streamCreated && event === 'data') {
      logEvents && console.log(`\n⭐  create stream for path ${path}`);

      const s3Stream = Storage.getFileStream(path);  // <<== stream is created here

      s3Stream
        .on('error', err => passThroughStream.emit('error', err))
        .on('finish', () => logEvents && console.log(`✅  finish stream for path ${path}`))
        .on('close', () => logEvents && console.log(`❌  stream close\n`))
        .pipe(passThroughStream);

      streamCreated = true;
    }
  });
  return passThroughStream;
};

// ...

for (const entry of content) {
  archive.append(getStream(entry.path), { name: entry.nameInArchive });
}

alexandruluca commented 5 years ago

@bebu259 Thanks, I'll give it a try

ivan-kuzma-scx commented 5 years ago

Facing same issue with node.js and reading 1GB file with stream.

It works fine, for example for half file and after that receiving TimeoutError: Connection timed out after 120000ms

` const dumpMeta = await getObjectMeta({ bucket, key: s3FileKey, });

downloadDumpStream = downloadStream({ bucket, key: s3FileKey, range: range(from, to, dumpMeta.ContentLength), });

downloadDumpStream.pipe(streamParser); `

Stream parser is slow consumer, but it pauses stream when concurrency limit reached, and resumes after inner queue flushed.

alexandruluca commented 5 years ago

@Patrik-scelex Are you downloading just one file or multiple files? If you are downloading too many multiple files at once, you may exhaust your bandwidth leading to a timeout or aws may even terminate your connection. For me the solution was to decrease maxSockets plus also throttle the maximum files a user can download at one time.

For decreasing maxSockets, use following when initializing the s3 instance

httpOptions: {
    agent: new https.Agent({maxSockets: 25}),
    timeout: 8 * 60 * 1000 // 8 minutes
}

If you didn't tweak this option, the default agent is used which has infinite configured (if I'm not mistaken), but s3 automatically allocates 50 in this case

https://docs.aws.amazon.com/sdk-for-javascript/v2/developer-guide/node-configuring-maxsockets.html

Based on your example, I am assuming you are creating multiple streams to download the 1GB file by applying a range, am I correct? If yes, the you can also try limiting the number of streams you are creating at once

ivan-kuzma-scx commented 5 years ago

Hey @alexandruluca , thanks for response. Actually, i'm trying to read one 1gb file with read stream: const downloadDumpStream = s3.getObject({ Bucket: bucket, Key: key, Range: range, }).createReadStream();

When stream receives data i'm processing it and storing in database. My consumer is enough slow, so pausing it when have lot of data and resumes, when there not pending data to process.

downloadDumpStream.on('data', slowConsumer);

My problem is error below, which i'm receving every time in different places in my file.

TimeoutError: Connection timed out after 120000ms

jdleesmiller commented 5 years ago

I was having a problem similar to the original issue, and I think I can explain what's happening.

TLDR: If you have too many parallel streams from S3, node-archiver can't keep up. It will stop reading from some of the requests that have already started, which means no activity on those sockets, which can trigger a timeout.

Minimal Example

This bug isn't really specific to S3. We can reproduce the problem with a simple test server and request, so we can better see what's going on.

Server

The server generates 10MB of random data and then serves it on each request. I've also added some extra logging so we can see when the socket actually closes for each request (not just when the express response lifecycle finishes); that will be helpful later.

const util = require('util')
const express = require('express')
const cryptoRandomBytes = util.promisify(require('crypto').randomBytes)
const morgan = require('morgan')

const app = express()
const port = 3000

const NUM_BYTES = 10 * 1024 * 1024

async function main () {
  const payloadBuffer = await cryptoRandomBytes(NUM_BYTES)

  app.use(morgan('tiny'))

  app.get('/', async (req, res) => {
    const start = time()
    req.socket.on('close', function () {
      console.log(`socket closed after ${Math.round(time() - start)}ms`)
    })
    res.send(payloadBuffer.toString('hex'))
  })

  app.listen(port, () => console.log(`listening on port ${port}`))
}
main()

function time () {
  var t = process.hrtime()
  return t[0] * 1e3 + t[1] / 1e6
}

Client

The client starts a batch of up to CONCURRENT_DOWNLOADS downloads at once and puts them all into a zip archive. There is a 5s timeout on the requests. The compression level LEVEL is also configurable; 0 means no compression (fast) and 9 means a lot of compression (slow).

Note that we use the undocumented entry event to log when archiver has finished storing the entry.

const util = require('util')
const archiver = require('archiver')
const fs = require('fs')
const request = require('request')

const SERVER_URL = 'http://localhost:3000'
const TIMEOUT = 5000 // ms
const LEVEL = 1
const CONCURRENT_DOWNLOADS = 10

async function main () {
  const output = fs.createWriteStream('/tmp/test.zip')
  const archive = archiver('zip', { level: LEVEL })

  output.on('close', function () {
    console.log(archive.pointer() + ' total bytes')
    console.log(
      'archiver has been finalized and the output file descriptor has closed.'
    )
  })

  output.on('end', function () {
    console.log('data has been drained')
  })

  archive.on('entry', function (entry) {
    console.log(`stored ${entry.name}`)
  })

  archive.on('warning', function (err) {
    if (err.code === 'ENOENT') {
      console.warn(err)
    } else {
      console.error(err)
    }
  })

  archive.on('error', function (err) {
    console.error(err)
  })

  archive.pipe(output)

  try {
    const downloads = []
    for (let i = 0; i < CONCURRENT_DOWNLOADS; ++i) {
      downloads.push(download(i, archive))
    }
    await Promise.all(downloads)

    archive.finalize()
  } catch (err) {
    console.error(err)
  }
}
main()

async function download (i, archive) {
  const start = time()
  try {
    await new Promise((resolve, reject) => {
      const response = request.get(SERVER_URL, { timeout: TIMEOUT })
      response.on('error', reject)
      response.on('end', resolve)
      archive.append(response, { name: `${i}.txt` })
    })
    const delay = Math.round(time() - start)
    console.log(`request ${i} done in ${delay}ms`)
  } catch (err) {
    const delay = Math.round(time() - start)
    throw new Error(`request ${i} failed in ${delay}ms: ${err.stack}`)
  }
}

function time () {
  var t = process.hrtime()
  return t[0] * 1e3 + t[1] / 1e6
}

Results

Note: the exact results will depend on how fast your machine is. These are on a 2016 Mac Book Pro with node 10.13.0 and archiver 3.0.0.

The Fast Case

If we set LEVEL = 0, so archiver is very fast, we get this from the client:

stored 0.txt
request 0 done in 671ms
stored 1.txt
request 1 done in 811ms
stored 2.txt
request 2 done in 954ms
stored 3.txt
request 3 done in 1082ms
stored 4.txt
request 4 done in 1234ms
stored 5.txt
request 5 done in 1366ms
stored 6.txt
request 6 done in 1552ms
stored 7.txt
request 7 done in 1816ms
stored 8.txt
request 8 done in 1966ms
stored 9.txt
request 9 done in 2290ms
209716242 total bytes
archiver has been finalized and the output file descriptor has closed.

In this case we can finish the whole batch of 10 concurrent downloads. However, we can see a pattern in the client logs: each request takes longer than the previous request (671ms, 811ms, 954ms, ...).

In the server logs, we see that express finishes serving the data in constant time (~50ms), but it takes longer and longer to close each socket:

GET / 200 20971520 - 46.049 ms
GET / 200 20971520 - 54.173 ms
GET / 200 20971520 - 50.819 ms
GET / 200 20971520 - 47.403 ms
GET / 200 20971520 - 44.660 ms
GET / 200 20971520 - 54.959 ms
GET / 200 20971520 - 48.732 ms
GET / 200 20971520 - 45.700 ms
GET / 200 20971520 - 55.521 ms
GET / 200 20971520 - 45.581 ms
socket closed after 626ms
socket closed after 683ms
socket closed after 877ms
socket closed after 857ms
socket closed after 1054ms
socket closed after 1037ms
socket closed after 1277ms
socket closed after 1439ms
socket closed after 1537ms
socket closed after 1775ms

This is basically because we started all the requests (opened all the sockets) at the same time, but archiver is processing them one at a time.

More Compression

So, what if we set LEVEL = 9 to make archiver slower? Client output:

request 0 done in 1635ms
stored 0.txt
request 1 done in 2761ms
stored 1.txt
request 2 done in 3812ms
stored 2.txt
request 3 done in 4861ms
stored 3.txt
Error: request 6 failed in 5364ms: Error: ESOCKETTIMEDOUT
    at ClientRequest.<anonymous> (/Users/john/write_latex_ops/analytics/node_archiver_test/node_modules/request/request.js:816:19)
    at Object.onceWrapper (events.js:273:13)
    at ClientRequest.emit (events.js:182:13)
    at Socket.emitRequestTimeout (_http_client.js:661:40)
    at Object.onceWrapper (events.js:273:13)
    at Socket.emit (events.js:182:13)
    at Socket._onTimeout (net.js:449:8)
    at ontimeout (timers.js:436:11)
    at tryOnTimeout (timers.js:300:5)
    at listOnTimeout (timers.js:263:5)
    at download (/Users/john/write_latex_ops/analytics/node_archiver_test/client.js:71:11)
    at process._tickCallback (internal/process/next_tick.js:68:7)
request 4 done in 5931ms
stored 4.txt

Now the requests are taking longer, so they back up faster. Requests 0-4 go through OK, but request 6 times out after a bit over 5s, which is the network timeout we set.

On the server, we see the same sort of pattern:

GET / 200 20971520 - 54.526 ms
GET / 200 20971520 - 61.667 ms
GET / 200 20971520 - 53.975 ms
GET / 200 20971520 - 48.429 ms
GET / 200 20971520 - 50.317 ms
GET / 200 20971520 - 66.456 ms
GET / 200 20971520 - 52.799 ms
GET / 200 20971520 - 49.345 ms
GET / 200 20971520 - 71.275 ms
GET / 200 20971520 - 50.014 ms
socket closed after 1537ms
socket closed after 2545ms
socket closed after 3665ms
socket closed after 4599ms
socket closed after 5068ms
socket closed after 5053ms
socket closed after 5050ms
socket closed after 5072ms
socket closed after 5051ms
socket closed after 5613ms

All the requests finish quickly as far as express is concerned, but the sockets take longer to close, leading up to 5s when they all time out.

Too Many Requests

Even if we don't have very heavy compression (LEVEL = 9), if we make enough concurrent requests we will get timeouts. For example, with LEVEL = 1 and CONCURRENT_DOWNLOADS = 20, we get up to request 10 before timing out, rather than request 6, but we still time out:

request 0 done in 1612ms
stored 0.txt
request 1 done in 2086ms
stored 1.txt
request 2 done in 2577ms
stored 2.txt
request 3 done in 3034ms
stored 3.txt
request 4 done in 3534ms
stored 4.txt
request 5 done in 4042ms
stored 5.txt
request 6 done in 4517ms
stored 6.txt
request 7 done in 5045ms
stored 7.txt
Error: request 10 failed in 5538ms: Error: ESOCKETTIMEDOUT
    at ClientRequest.<anonymous> (/Users/john/write_latex_ops/analytics/node_archiver_test/node_modules/request/request.js:816:19)
    at Object.onceWrapper (events.js:273:13)
    at ClientRequest.emit (events.js:182:13)
    at Socket.emitRequestTimeout (_http_client.js:661:40)
    at Object.onceWrapper (events.js:273:13)
    at Socket.emit (events.js:182:13)
    at Socket._onTimeout (net.js:449:8)
    at ontimeout (timers.js:436:11)
    at tryOnTimeout (timers.js:300:5)
    at listOnTimeout (timers.js:263:5)
    at download (/Users/john/write_latex_ops/analytics/node_archiver_test/client.js:71:11)
    at process._tickCallback (internal/process/next_tick.js:68:7)
request 8 done in 5543ms
stored 8.txt
request 9 done in 6090ms
stored 9.txt

Analysis

Internally, archiver uses an async queue with concurrency 1:

https://github.com/archiverjs/node-archiver/blob/488f2040b5cba271fd2c5ee50edc31595d55777d/lib/core.js#L52

(I think this is because the zip file format is basically linear; it's not a deficiency of archiver per se. I think other zip libraries, such as yazl would have the same problem.)

I think this means that even if you feed a lot of streams into archiver in parallel, it will only process one at a time. The problem is that in order to get a stream from the AWS SDK, we have to start the request (get the headers and start streaming the body), so the stream is open while it is waiting in archiver's internal queue, even though archiver hasn't got to it yet. We're therefore acting like a 'slow client', which can trigger the socket timeouts.

Solutions

If you download one file at a time (basically run the code above with 1 concurrent download serially for each file), that should be pretty safe. However, if you have a lot of small files and/or light compression, this will probably be quite a bit slower. If you have large files and/or heavy compression, I would guess it would not be much slower.
Extending the timeout helps, because it lets the client keep the sockets open longer, which gives node archiver more time to catch up. However, holding open the connections is not free, and if there is an actual network issue, it will take longer for you to find out about it.
Another approach would be to buffer the data to disk (or in memory, if you're sure you have enough memory, but that mostly defeats the purpose of streaming). You could start a large number of downloads to temp files and then zip the temp files. That would be safe, because there's no problem with leaving the data on disk while archiver catches up, but again it's likely to be slower if there are a lot of small files or light compression.

Which solution is preferred will depend on the use case.

Edit: Also, I think the lazy stream solution in https://github.com/aws/aws-sdk-js/issues/2087#issuecomment-474722151 would avoid the timeouts, as would wrapping the S3 streams with something like the lazystream package. Both would start the request lazily, but AFAICT it would only start one request at a time, as in solution (1).

dougmoscrop commented 5 years ago

Does the AWS SDK use archiver under the hood? I'm fetching objects that are compressed, but I'm not telling S3 about that, so I decompress in 'userland', and I still get timeouts. I get timeouts on PUT too sometimes. Could be a completely unrelated problem (and to be fair, so could the archiver one)

xinyangyuan commented 5 years ago

Have the same issue, and was able to get away with the timeout error by manually closes the stream:

const stream = s3.getObject(params).createReadStream()

stream.on('end', () => {
    console.log('already done');
    stream.end(); // mannually closes the stream
  });

stream.pipe(res);

Update: the following code does seem to handle all unclosed s3 streams in my project:

try {
  var stream = s3Service.s3.getObject(params).createReadStream();

  stream.on('error', () => stream.end()); // mannually closes the stream on error
  req.on('close', () => stream.end()); // ensure stream is closed if request is closed/cancelled

  stream.pipe(res);
} catch {
  if (stream) stream.end();
}

kishor-nrpt commented 5 years ago

Any update on this issue, this is stopping me from implementing feature in our project.

kishor-nrpt commented 5 years ago

please suggest us the alter way of coding to stream file from s3..

jpwiddy commented 4 years ago

Any ideas on this? Are we left manually closing the stream?

jpwiddy commented 4 years ago

To those here with the timeout still happening after the stream is finished, you should check the stream's readableFlowing value - https://nodejs.org/api/stream.html#stream_three_states

Calling readable.pause(), readable.unpipe(), or receiving backpressure will cause the readable.readableFlowing to be set as false, temporarily halting the flowing of events but not halting the generation of data. While in this state, attaching a listener for the 'data' event will not switch readable.readableFlowing to true.

Mine was set to false still?

gregory commented 4 years ago

you guys should check out: https://github.com/miratronix/s3-readable-stream

amiantos commented 4 years ago

@bebu259 thanks for the suggestion! this actually helped me quite a bit up to around 300 files. After 300 files my lambda function dies entirely with no error output or anything. Not quite sure what the deal is just yet, still debugging. 300 files zip up in about 20 seconds, so I don't think it's a timeout issue. but mostly wanted to say thanks and 👍

e: I think I am running up against some sort of size limit, not quite sure... the failing test that had 400+ files in it also had a 1GB junk file in it. 400+ <20mb files zip up just fine.

alex-jitbit commented 4 years ago

I came here from Google and wanted to leave a note what was the reason in my case.

I was getting this error TimeoutError: Connection timed out after 120000ms with a single file, not bulk-download, archiving or anything. Just downloading a single file from s3 in an express app.

But I was piping the stream from s3 to 2 (two) writable streams at once. One was the http-response stream, the other was caching to a local file.

s3stream.pipe(cacache.put.stream(_cachePath, filekey)); //pipe to disk cache    
s3stream.pipe(res); //pipe to http response

Don't do that. If you do that, fro some reason the s3stream's 'close' event never gets raised. It stays open and eventually you get a timeout.

Hope I saved someone's time.

jmmadsen commented 4 years ago

Any updates on this? I am using the below snippet, but for some reason still get the timeout error despite calling cosStream.end()...

const cosStream = cos.getObject(params, (err) => {
            if (err) {
                console.error(err);
                res.sendStatus(500);
            }
        }).createReadStream();

cosStream
            .on('error', (err) => {
                console.log('*** there was an error ***');
                console.log(err);
                res.sendStatus(500);
            })
            .on('end', () => {
                console.log('stream was closed');
                cosStream.end();
            })

vovanluc1 commented 4 years ago

I have a node-express application with a GET route that serves the download of a zip archive. The zip file is created on-the-fly when the request starts. To zip all requested files into one archive I use node-archiver. The single files are streamed from an AWS S3 bucket and the zipped archive is piped into the response.

I use: aws-sdk version: 2.256.1 express version: 4.16.3 node version: 8.3.0

The code:
export async function downloadArchive(req, res, next) {
  const { archiveContent } = await getArchiveContent(req.params.id);

  // --- create archive ---

  const archive = archiver('zip');

  archive
    .on('error', (err) => {
      console.log('archive error', err);
    })
    .on('entry', entry => {
      console.log('archive event: entry ', entry.name);
    })
    .on('progress', data => {
      console.log('archive event: progress', data);
    })
    .on('data', data => {
      console.log(`archive event: data | processed total ${(archive.pointer() / (1024 ** 2)).toFixed(3)} MB | data length: ${data.length}`);
    });

  res.setHeader('Content-disposition', `attachment; filename*=UTF-8''content-archive.zip; filename=content-archive.zip`);
  res.setHeader('Content-type', 'application/zip');

  archive.pipe(res);

  // --- create all S3 read streams and append them to the archive ---

  for (const entry of archiveContent) {
    console.log(`create stream for ${entry.nameInArchive}`);

    const stream = s3
      .getObject({ Bucket: S3_BUCKET_NAME, Key: entry.path })
      .createReadStream()
      .on('error', err => {
        console.log('stream error', err);
      })
      .on('finish', () => {
        console.log('stream finish');
      })
      .on('close', () => {
        console.log('stream close');
      });

    // append read stream to archive:
    archive.append(stream, { name: entry.nameInArchive });
  }

  // --- response events ---

  res
    .on('error', err => {
      console.log('res error', err);
    })
    .on('finish', () => {
      console.log('res finish');
    })
    .on('close', () => {
      console.log('res close');
      archive.abort();
    });

  archive.pipe(res);
  archive.finalize();
}
The code works but there are some things I do not quite understand:

1. Receiving Timeouts after read streams have finished

When the download finishes sucessfully, finish event is triggered on all read streams. But 120 seconds later, there is an error event triggered on each read stream: TimeoutError: Connection timed out after 120000ms. Is this expected behavior or do I have to disconnect all read streams manually?

2. A new requests get stuck when previous request was canceled

The scenario is like this:

the user starts a request for zipping a lot of files (let's say 200)

after the download has started, the user cancels the download in his browser

a close event is triggered on the response and the archive stream is closed

then the user starts the same request for a second time

all 200 read streams are created in this second request again

but the archive get's stuck and no more bytes are read from the read streams

I tried to figure out, what happens here... Maybe the first 200 open connections of the canceled read streams are the cause...? When I do this scenario with small numbers of files (<10) each new request after a canceled request works well.

My questions:

is there a limit of open connections to s3 objects?

do I have to / how can I close connections (that are not needed anymore)?

in general (and maybe more a node.js performance question): is good practice to open hundreds of streams at once?

Hi, I think you should change your code like this:

`export async function downloadArchive(req, res, next) { const { archiveContent } = await getArchiveContent(req.params.id);

// --- create archive ---

const archive = archiver('zip');

archive .on('error', (err) => { console.log('archive error', err); }) .on('entry', entry => { console.log('archive event: entry ', entry.name); }) .on('progress', data => { console.log('archive event: progress', data); }) .on('data', data => { console.log(archive event: data | processed total ${(archive.pointer() / (1024 ** 2)).toFixed(3)} MB | data length: ${data.length}); });

res.setHeader('Content-disposition', attachment; filename*=UTF-8''content-archive.zip; filename=content-archive.zip); res.setHeader('Content-type', 'application/zip');

archive.pipe(res);

// --- create all S3 read streams and append them to the archive ---

for (const entry of archiveContent) { console.log(create stream for ${entry.nameInArchive});

const stream = s3
  .getObject({ Bucket: S3_BUCKET_NAME, Key: entry.path })
  .createReadStream()
  .on('error', err => {
    console.log('stream error', err);
  })
  .on('finish', () => {
    console.log('stream finish');
  })
  .on('close', () => {
    console.log('stream close');
  });

// append read stream to archive:
archive.append(stream, { name: entry.nameInArchive });

}

// --- response events ---

res .on('error', err => { console.log('res error', err); }) .on('finish', () => { console.log('res finish'); }) .on('close', () => { console.log('res close'); archive.abort(); });

//archive.pipe(res); => remove this line archive.finalize(); }`

amiantos commented 4 years ago

I set up a GitHub gist with my solution regarding zipping up multiple files from S3 using a Lambda function over here. It uses part of the solution mentioned by @bebu259 and without their suggestion this probably wouldn't have worked.

I've tested it by zipping up over 400 files stored in S3 and it worked quickly and flawlessly. Hope it helps someone!

https://gist.github.com/amiantos/16bacc9ed742c91151fcf1a41012445e

tstecca commented 4 years ago

I've been working on a AWS lambda function for a few days now to zip the contents of an s3 bucket. My buckets have up to 6000 in them and I was having the same issue as several people here where the lambda function would just exit unexpectedly. Lambda was throwing no errors and neither were s3.getObject or Archiver.

jdleesmiller definitely got to the bottom of it. It's has to do with Archiver being unable to handle the volume of streams and you have to slow down the rate of either appending or piping. I tried several ways of slowing down the process, including awaiting the end, close, or finish events of streams before appending the next one to archive, but none seemed to work.

In the end, I have arrived at a working solution based on @bebu259 's answer. https://github.com/aws/aws-sdk-js/issues/2087#issuecomment-474722151. @amiantos is also implementing it, as he mentions above. The key seems to be appending a PassThrough stream to the archiver instead of the s3 stream. Then you have to wait for the PassThrough 'newListener' event to fetch the s3 stream and pipe it.

Many thanks to all on this thread for the assistance.

tteke commented 3 years ago

I am experiencing the same issue on a different use case. When I try to stream different ranges of a large file. Front end gets all the chunks just fine, but after 120000ms, all getObject()'s with range requests get a timeout error. No archiving happening in my case so I dont think its related to node-archiver

tteke commented 3 years ago

The problem is definitely starts with piping the request because if I use the following snippet instead of piping the stream timeout doesn't occur. However, whole file is sent significantly slower (in my case, the video buffers significantly slower).

readStream.on("readable", () => {  
  const d = readStream.read();  
  if (d === null) return;  
  res.write(d);
});  
readStream.on("end", () => {  
  res.end();  
});

henrituan commented 3 years ago

I've ran into this problem and found a solution inspired by this stackoveflow answer

What I need was to abort the S3 request when the stream finish, using on('end') event and abort() method on S3 Object:

const getReadStreamFromS3 = ( fileName: string) => {
    const awsParams = {
      Bucket: 'my bucket',
      Key: fileName,
    };
    const request = this.s3Bucket.getObject(awsParams);

    return request.createReadStream()
        .on('error', (e) => {
            console.log('s3 streaming error');
            console.log(e);
        })
        .on('end', () => {
            console.log('s3 streaming end');
            request.abort();
        });
};

Hope this could help

amankumar4real commented 2 years ago

I had the same issue and it turns out that this was because while using the docker container I was giving my local's download path instead of docker. Once I changed that the issue was resolved.

zpetukhov commented 2 years ago

The reason of the described issue is clear enough: there's a simple loop in the code where all the s3 streams are opened and appended to the archive. But the archiver processes them one by one. And, when client downloads slowly, we can easily reach 120 seconds timeout. One of possible solutions is to prepare s3 streams and append them to the archive "one by one", i.e. we create the next stream and append it only after the previous stream was processed. I wrote a small lib which uses that approach, you can take a look at the code: https://github.com/zpetukhov/yas3z/blob/main/yas3z.js Take a look at this line:

archive.on('entry'

This is the place where we feed the archiver with the next s3 file. So we open s3 stream just before the archiver will consume it. Instead of doing that at the beginning and keeping a bunch of them open, which results in timeouts (only when client downloads slowly enough (hint: use curl and --limit-rate parameter for testing)).

And I also created an expressJS example, in the same repo: https://github.com/zpetukhov/yas3z/blob/main/examples/expressjs/app.js

TheMakerOfWorlds commented 9 months ago

Still seem to be getting the same error when downloading multiple very large files

RanVaknin commented 4 months ago

Hi Everyone on the thread.

This issue is quite old, and based on the in depth root cause analysis made here on the thread this does not seem to be directly related to the SDK.

Since v2 is on the road to maintenance, and in the v3 SDK GetObject's response is already returns a stream by default, I don't think this applies and therefore I would not transfer it there and instead close this issue. Even though the issue is closed it should still be discoverable for legacy customers who are still seeking solutions on the v2 repo.

Thanks, Ran~

aws / aws-sdk-js