`fetch` way slower than `clone`

To document as a starting point:

Currently fetch is slower than clone because clone downloads all the table files in bulk, and fetch starts from the DAG and downloads the missing chunks, walking the DAG as it goes. fetch does try to use concurrency and in theory it could be much better, but it's inherently more expensive because it's doing lots of little fetches, making RPCs to computing the location of the fetches it needs to do, making local queries to see if it needs to make those fetches, etc.

There are lots of things we can do to improve the performance of fetch. In general, we have experimental rewrites which allow us to structure pulling as more of a pipeline, and we could iterate on those to get better parallelism and control over the concurrency of downloads and chunk ref walks, for example.

One thing we know we need to do, but we have not gotten around to, is to make I/O scheduling better on the pull path. As pulling proceeds, it gets to the leaf layers of the prolly trees, which have very high fan out, and it ends up with a large number of chunk addresses for the chunks it needs to fetch. Currently the only interface available to it is to call into (ChunkStore).GetManyCompressed with a subset of those addresses and to download a batch at a time. In reality, you want concurrently resolve all in flight chunk addresses to physical locations, and you want to batch resolved chunk locations into large fetches by coalescing the fetches of (near-adjacent) chunks with each other. The current ChunkStore implementation does coalesce, but because we build the chunks addresses into fixed-sized chunks in a location-agnostic way, it misses out on lots of opportunities to actually batch. Similarly, because the interface is just GetManyCompressed(addrs []hash.Hash), we do not resolve the addresses to physical locations until we go into that call, and we do it in a sequential fashion – we resolve all the addresses in the batch, batch the locations all up into downloads as best we can, and then send off the concurrent downloads. Similarly, the GetManyCompressed interface only returns after all of the chunks have been fetched, at which point we can use that concurrency slot for another batch of chunk addresses. The end result is that fetch performance suffers a lot from

Failure to coalesce downloads of chunks when the opportunity is actually present.
Stragglers on the location-resolution and download path reducing the throughput of the overall system.

If you fix this later issue, the next thing to tackle is download concurrency and scheduling in general. The remote store implementation currently has some pretty simple hard coded concurrency parameters – how many concurrent small fetches we make, how many concurrent large fetches we make, what's our threshold for hedging a download, etc. Better adaptive algorithms there would allow us to better utilize the available bandwidth and to perform well in a variety of different network conditions.

dolthub / dolt

`fetch` way slower than `clone` #6207