Unload big partitions: automatically tune schema splits and ops per second on retry for timeouts

datastax / dsbulk

DataStax Bulk Loader (DSBulk) is an open-source, Apache-licensed, unified tool for loading into and unloading from Apache Cassandra(R), DataStax Astra and DataStax Enterprise (DSE)

Apache License 2.0

85 stars 30 forks source link

Unload big partitions: automatically tune schema splits and ops per second on retry for timeouts #449

Open phact opened 2 years ago

phact commented 2 years ago

Users dumping entire tables often hit timeouts when they reach large partitions. The solution is to manually tune splits and throughput until the unload works but this is very time consuming and error prone.

Would be great if dsbulk could handle this common scenario by itself.

┆Issue is synchronized with this Jira Task by Unito

adutra commented 2 years ago

Related: #448.

adutra commented 2 years ago

I don't think tuning splits would make a big difference, and btw, that's near impossible since the splits determine how many taken ranges are going to be read, so this happens at a very early phase.

But tuning throughput, yes, definitely. Probably based on latencies, and probably governed by a high/low watermark system.

phact commented 2 years ago

I don't think tuning splits would make a big difference

It does. This is how I've had to do things many times when dsbulk unload fails.

The reason is usually a big partition, smaller splits can help it actually finish. Sometimes if that doesn't do the trick we end up having to bisect the range around it and then throttle.