isoos commented 6 years ago

Right now we upload all the generated files separately to the bucket and serve / proxy them from there. Uploading them one-by-one takes a long time, way more than generating them.

I'm wondering if it would be less wasteful on resources if we were uploading only the archive, and when the dartdoc service gets a request to something inside, it would use a disk-based cache to serve these (and download the package archive if it is not in the cache).

@jakobr-google wdyt?

jakobr-google commented 6 years ago

I don't know how much local storage we have available, but it sound like it'd be worth exploring.

It would mean that first access to a package's doc would be slower, potentially across all of the (max 8) instances, though, so I'm not sure how well it'll perform.

isoos commented 6 years ago

We have 8 instance only because doc generation (and upload) takes a long time. Once it is done, it will go back to 2 instances.

We could also pre-fetch the latest archives of the most popular packages, making their access fast.

isoos commented 6 years ago

Based on a random sampling, I think the documentation of the latest stable versions compressed should be around 1-1.4 GB, but uncompressed size would be around 10-15 GB. However the access would not always on the latest, as cross-documentation links are versioned to the resolved package version.

The default instance that we are using has ~4 GB free disk space for our use. We can request more resources, but it may be cheaper to request memory and/or use memcache more aggressively for this.

Before going ahead with the actual implementation of the local-disk-based serving, I think I would like to do a bit of tracking first:

storing the compressed and uncompressed size of the archive
track the network traffic with the current direct-to-bucket requests
emulate as if we were using an LRU-based local disk cache: what is the new disk and network usage

/rfc @kevmoo @sortie

kevmoo commented 6 years ago

Leaving it as an engineering decision. I support measuring first.

Have you looked at: https://cloud.google.com/storage/docs/best-practices#uploading

Also: I wonder if "shelling out" to gsutil would be faster – something to investigate?

Or maybe create a gRPC client? https://github.com/googleapis/googleapis/tree/master/google/storagetransfer/v1

isoos commented 6 years ago

Have you looked at: https://cloud.google.com/storage/docs/best-practices#uploading Also: I wonder if "shelling out" to gsutil would be faster – something to investigate?

My (still not in-depth) search shows the following:

there is no bulk upload api,
gsutil uses the same api endpoints as we do,
"storage transfer" is a service for transferring between buckets (e.g. from AWS S3, or between Goole Buckets)

There seems to be no quick win here. We could increase the concurrency of the uploads, but I think we have bigger gains if the archive-only approach works.

kevmoo commented 6 years ago

ack

On Mon, Jun 4, 2018 at 2:02 AM István Soós notifications@github.com wrote:

Have you looked at: https://cloud.google.com/storage/docs/best-practices#uploading Also: I wonder if "shelling out" to gsutil would be faster – something to investigate?

My (still not in-depth) search shows the following:

there is no bulk upload api,

gsutil uses the same api endpoints as we do,

"storage transfer" is a service for transferring between buckets (e.g. from AWS S3, or between Goole Buckets)

There seems to be no quick win here. We could increase the concurrency of the uploads, but I think we have bigger gains if the archive-only approach works.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dart-lang/pub-dartlang-dart/issues/1333#issuecomment-394284758, or mute the thread https://github.com/notifications/unsubscribe-auth/AABCisMUq1Cdj16cAQJ8nhlFX9sbr6KKks5t5PeygaJpZM4US984 .

isoos commented 6 years ago

1342 should provide reasonable data on the requirements, but with a bit more experimenting I fear that real decision point will be the latency of downloading the `package.tar.gz` and uncompressing it to local disk.

Maybe this should be decided per-PackageVersion:

the latest or the recently published versions will get accessed more likely: they would be the same as today (per-file + archived uploads)
older versions, especially the non-latest that are over a year old could be storing only the package.tar.gz, and when accessed, pay the latency hit

sortie commented 6 years ago

Some ideas without having looked much into the problem: Is there a way to upload an archive and ask the remote to extract it? Could we ask the cloud people a bulk upload option? Can we upload files individually but much more concurrently? Can we have a background process download the uploaded archives and convert them to individual files, without blocking the initial upload?

isoos commented 6 years ago

Is there a way to upload an archive and ask the remote to extract it? Could we ask the cloud people a bulk upload option?

I am not aware of anything like that, but would be happy if it would exists.

Can we upload files individually but much more concurrently?

When we have a new version deploy, we have ~16 process uploading files with concurrency=4 each. We can easily increase that concurrency, not sure what the effect would be in the storage bucket.

Can we have a background process download the uploaded archives and convert them to individual files, without blocking the initial upload?

That wouldn't save us much, as with the current structure, we don't make the new version active until all of the files were not uploaded. It would shift when the actual upload happens, but wouldn't unblock anything else.

kevmoo commented 6 years ago

You could always use triggers and pub/sub to schedule an extract of the archive after upload.

More plumbing, but at least it could be done in the background...

isoos commented 6 years ago

Wow, Github's edit the comment feature allowed me to edit @kevmoo's comment, and it did in a way that I didn't realize it. Anyway, the reply to it:

You could always use triggers and pub/sub to schedule an extract of the archive after upload.

I don't think we win anything by it: the serving of those assets starts only after the files are extracted and uploaded again, and the process just adds more latency with the pub/sub queue handling.

The clear win is when we don't need to upload the individual files at all, only the archive.

isoos commented 6 years ago

Tracking was running for almost a day, and it revealed some interesting numbers already:

If we were to store and extract all archives locally, it would require at least 5 GB of disk space at its (current) peak, and we could be storing ~500 package version archives locally at any time. It is not tracked in an exact way, but based on the fluctuation of the size requirements, there is a large turnover of archives, as if there were a few requests for a certain version, and then that version is not requested anymore.

If we were to store only the obsolete version archives (non-latest, older than two years), it would be much less package at a time (~40) and much less disk space (current samples topped out at ~1 GB), but the turnover is there too.

It not clear if/how search engine optimisation in #1368 would affect these numbers.

isoos commented 6 years ago

It looks like that the load varies a bit, the numbers topped out this week:

15 GB disk space for all archives locally (~1800 at the same time)
2GB if only the archived ones (~100 at the same time)

isoos commented 6 years ago

I'm ditching the idea of storing obsolete archives differently. Based on my experiments, the appengine load balancing don't always routes subsequent request to the same host, and even if we would block on the first request to download and extract the archive, the second request may go to a different instance, and we would be doing the very same thing again there too.

Instead, I would check if we could increase the upload parallelism, and maybe turn off the link validation in the generated docs.

jonasfj commented 5 years ago

We could consider:

bulk upload API for GCS
A GCS bucket can handle 1k puts / s and will scale-as-needed doubling every 20 min
Maybe do 100 concurrent uploads, and set http.Client.maxConnectionsPerHost to something like 50, that way we won't have connections timing out on us.
Measure if we're running out of ports (TIME_WAIT in netstat)

Or we could store documentation files in datastore, if we're truly hitting GCS upload limits, which I suspect we're not. If we are at 1k writes / s, we could also distribute packages across 100 buckets :)

isoos commented 5 years ago

@jonasfj are you aware of any (upload) throttling in the gcloud or appengine packages?

jonasfj commented 5 years ago

Well obviously, we could bump concurrency to 100 without much controversy...

https://github.com/dart-lang/pub-dartlang-dart/blob/77687ab7c39f02547315fd8433f5965a096ff36a/app/lib/dartdoc/backend.dart#L34

And since we can stat the files we can supply length in: https://github.com/dart-lang/pub-dartlang-dart/blob/77687ab7c39f02547315fd8433f5965a096ff36a/app/lib/dartdoc/backend.dart#L159

This will likely avoid resumable uploads, in many cases.

Also we should start to ask questions about: https://github.com/dart-lang/gcloud/blob/fbe7ef67f9234e2af00c18db6996c08f26a006ee/lib/src/storage_impl.dart#L487

Maybe it should 32MB instead of 1MB, I'm not entirely sure how this resumable uploads work, but I suspect it creates a request for every 1MB chunk. I'm also not sure what the recommended chunk size is, maybe @mkustermann know why it's at 1MB?

I didn't check appengine or gcloud for http.Client limitations on number of concurrent connections. But with a concurrency limit at 8, I would be surprised if we were close.

mkustermann commented 5 years ago

Maybe it should 32MB instead of 1MB, I'm not entirely sure how this resumable uploads work, but I suspect it creates a request for every 1MB chunk. I'm also not sure what the recommended chunk size is, maybe @mkustermann know why it's at 1MB?

The API is very user-friendly by allowing for example a await byteStream.pipe(storage.write("<object-name>")). i.e. The user just provides a Stream<List<int>> and the API will try to reliably upload this stream of bytes. Note that the length is usually not known to the API.

So first of all trying to upload huge amount of data with a single http POST, will have a very high probability of failure. So there is simply no point in trying to upload it in one request and hope that it works (even retrying make consistently fail if the data is big enough). That's implies we might need to upload in chunks, which is implemented with the resumable upload protocol.

It sill leaves the questions a) when to use resumable uploads vs normal uploads b) what happens if a chunk or the entire upload fails.

a) In order to answer when to use normal uploads vs resumable uploads, we let the data start flowing in and if we got all data and have less than 1 MB we use normal upload else we use a resumable upload.

b) For small uploads we assume they have a high success probability p (let's say 0.99), so we use a simple single upload. For very large resumable uploads the success probability goes down quickly, namely with p^n where n is the number of chunks. Already with 10 chunks we have 0.99 ^ 10 = 0.90, i.e. 10% failure probability. So for resumable uploads, we have to retry, which happens here. The chunk size defaults to 1MB here as well.

Coming back to the user API, we have a Stream<List<int>> interface. Yet for both purposes above we buffer data in memory either for finding out which upload mechanism to use, or for being able to retry.

So the 1 MB is the amount of RAM we use for buffering (depending on how the bytes are stored, it might even be a multiple of that).

jonasfj commented 5 years ago

I guess I'm wondering why it's 1MB, it means we'll be creating a 1k requests to upload 1GB. All I see in docs is that it should be a multiple of 256kb, no recommendations for what size will give good performance. I think python uses 100 MB per chunk by default.

Note: this is probably not the bottleneck for us here. More likely that not, our uploads would go faster if we just supplied the length, since that would avoid resumable uploads for small files, and most of what we have is small files. (obviously, we should also is increase concurrency of uploads to something like 100)

isoos commented 5 years ago

obviously, we should also is increase concurrency of uploads to something like 100

A note here: the concurrency setting we have is per-processing, and if we have 64+ instances (which we had for one of the releases), it is already hitting the 256 concurrent uploads. I'd rather go towards using the bulk upload API.

jonasfj commented 5 years ago

The scaling limits are not request limits as far as I read them, hence, I don't think bulk API will get us around the 1k req/s limit.

But let's just get close to that 1k, currently we have a max of 8 instances for dartdoc, let's give them upload concurrency of 64 or so..

mkustermann commented 5 years ago

Keep in mind that it is a DoS surface when a client can easily cause the server to crash with an OOM (out of memory). Dart's way of doing I/O is already not very nice in terms of control of memory (due to unbounded buffers in StreamSink<List<int>>s), so if you increase the in-memory buffers to high levels, e.g. 100 MB, it's easy to bring the server down by just a few requests.

Also huge buffers might cause a lot of stress on the Dart VM's garbage collector.

It always depends on the use-case. If the process is doing only a single thing at a time, then it's fine to increase the limit.

isoos commented 2 years ago

This has been implemented.

dart-lang / pub-dev

Simplify dartdoc storage #1333

1342 should provide reasonable data on the requirements, but with a bit more experimenting I fear that real decision point will be the latency of downloading the `package.tar.gz` and uncompressing it to local disk.

dart-lang / pub-dev

Simplify dartdoc storage #1333

1342 should provide reasonable data on the requirements, but with a bit more experimenting I fear that real decision point will be the latency of downloading the package.tar.gz and uncompressing it to local disk.

1342 should provide reasonable data on the requirements, but with a bit more experimenting I fear that real decision point will be the latency of downloading the `package.tar.gz` and uncompressing it to local disk.