Open timsehn opened 1 year ago
To document as a starting point:
Currently fetch
is slower than clone
because clone
downloads all the table files in bulk, and fetch
starts from the DAG and downloads the missing chunks, walking the DAG as it goes. fetch
does try to use concurrency and in theory it could be much better, but it's inherently more expensive because it's doing lots of little fetches, making RPCs to computing the location of the fetches it needs to do, making local queries to see if it needs to make those fetches, etc.
There are lots of things we can do to improve the performance of fetch
. In general, we have experimental rewrites which allow us to structure pulling as more of a pipeline, and we could iterate on those to get better parallelism and control over the concurrency of downloads and chunk ref walks, for example.
One thing we know we need to do, but we have not gotten around to, is to make I/O scheduling better on the pull path. As pulling proceeds, it gets to the leaf layers of the prolly trees, which have very high fan out, and it ends up with a large number of chunk addresses for the chunks it needs to fetch. Currently the only interface available to it is to call into (ChunkStore).GetManyCompressed
with a subset of those addresses and to download a batch at a time. In reality, you want concurrently resolve all in flight chunk addresses to physical locations, and you want to batch resolved chunk locations into large fetches by coalescing the fetches of (near-adjacent) chunks with each other. The current ChunkStore
implementation does coalesce, but because we build the chunks addresses into fixed-sized chunks in a location-agnostic way, it misses out on lots of opportunities to actually batch. Similarly, because the interface is just GetManyCompressed(addrs []hash.Hash)
, we do not resolve the addresses to physical locations until we go into that call, and we do it in a sequential fashion – we resolve all the addresses in the batch, batch the locations all up into downloads as best we can, and then send off the concurrent downloads. Similarly, the GetManyCompressed
interface only returns after all of the chunks have been fetched, at which point we can use that concurrency slot for another batch of chunk addresses. The end result is that fetch performance suffers a lot from
If you fix this later issue, the next thing to tackle is download concurrency and scheduling in general. The remote store implementation currently has some pretty simple hard coded concurrency parameters – how many concurrent small fetches we make, how many concurrent large fetches we make, what's our threshold for hedging a download, etc. Better adaptive algorithms there would allow us to better utilize the available bandwidth and to perform well in a variety of different network conditions.
It looks like we do lots of parallel downloads for
clone
but forfetch
we proceed serially. This makes a large difference for large databases.For instance:
https://www.dolthub.com/repositories/dolthub/transparency-in-pricing
I can clone it (65GB) in 15 minutes:
But adding a fork and fetching it has been running for 2 hours and I have just under 10GB:
We should do the same parallel thing we do on clone for fetch if we can.