dolthub / dolt

Dolt – Git for Data
https://www.dolthub.com
Apache License 2.0
18.01k stars 516 forks source link

`fetch` way slower than `clone` #6207

Open timsehn opened 1 year ago

timsehn commented 1 year ago

It looks like we do lots of parallel downloads for clone but for fetch we proceed serially. This makes a large difference for large databases.

For instance:

https://www.dolthub.com/repositories/dolthub/transparency-in-pricing

I can clone it (65GB) in 15 minutes:

$ time dolt clone dolthub/transparency-in-pricing      
cloning https://doltremoteapi.dolthub.com/dolthub/transparency-in-pricing
dolt clone dolthub/transparency-in-pricing  101.36s user 220.56s system 33% cpu 15:54.65 total

But adding a fork and fetching it has been running for 2 hours and I have just under 10GB:

$ dolt remote add rl1987 rl1987/transparency-in-pricing
$ dolt fetch rl1987
Downloaded 5,409,350 chunks, 9.5 GB @ 2.6 MB/s.

We should do the same parallel thing we do on clone for fetch if we can.

reltuk commented 1 year ago

To document as a starting point:

Currently fetch is slower than clone because clone downloads all the table files in bulk, and fetch starts from the DAG and downloads the missing chunks, walking the DAG as it goes. fetch does try to use concurrency and in theory it could be much better, but it's inherently more expensive because it's doing lots of little fetches, making RPCs to computing the location of the fetches it needs to do, making local queries to see if it needs to make those fetches, etc.

There are lots of things we can do to improve the performance of fetch. In general, we have experimental rewrites which allow us to structure pulling as more of a pipeline, and we could iterate on those to get better parallelism and control over the concurrency of downloads and chunk ref walks, for example.

One thing we know we need to do, but we have not gotten around to, is to make I/O scheduling better on the pull path. As pulling proceeds, it gets to the leaf layers of the prolly trees, which have very high fan out, and it ends up with a large number of chunk addresses for the chunks it needs to fetch. Currently the only interface available to it is to call into (ChunkStore).GetManyCompressed with a subset of those addresses and to download a batch at a time. In reality, you want concurrently resolve all in flight chunk addresses to physical locations, and you want to batch resolved chunk locations into large fetches by coalescing the fetches of (near-adjacent) chunks with each other. The current ChunkStore implementation does coalesce, but because we build the chunks addresses into fixed-sized chunks in a location-agnostic way, it misses out on lots of opportunities to actually batch. Similarly, because the interface is just GetManyCompressed(addrs []hash.Hash), we do not resolve the addresses to physical locations until we go into that call, and we do it in a sequential fashion – we resolve all the addresses in the batch, batch the locations all up into downloads as best we can, and then send off the concurrent downloads. Similarly, the GetManyCompressed interface only returns after all of the chunks have been fetched, at which point we can use that concurrency slot for another batch of chunk addresses. The end result is that fetch performance suffers a lot from

  1. Failure to coalesce downloads of chunks when the opportunity is actually present.
  2. Stragglers on the location-resolution and download path reducing the throughput of the overall system.

If you fix this later issue, the next thing to tackle is download concurrency and scheduling in general. The remote store implementation currently has some pretty simple hard coded concurrency parameters – how many concurrent small fetches we make, how many concurrent large fetches we make, what's our threshold for hedging a download, etc. Better adaptive algorithms there would allow us to better utilize the available bandwidth and to perform well in a variety of different network conditions.