bulk: audit existing cluster settings and update documentation

msbutler commented 2 years ago

Bulk jobs interact with many tunable cluster settings. Some of these have public and/or internal advice to tune them. This documentation may be outdated and should be audited and updated. Further, some cluster settings may need to be set to private or removed all together. Below is an attempt to list all tunable cluster settings the DR team should consider auditing (at least for 22.2):

[ ] bulkio.backup.checkpoint_interval
[ ] bulkio.backup.file_size
[ ] bulkio.backup.merge_file_buffer_size
[ ] bulkio.backup.read_retry_delay
[ ] bulkio.backup.read_timeout
[ ] bulkio.column_backfill.batch_size
[ ] bulkio.column_backfill.update_chunk_size_threshold_bytes
[ ] bulkio.import.processors_per_node
[ ] bulkio.import.reader_parallelism
[ ] bulkio.import.replan_flow_frequency
[ ] bulkio.import.replan_flow_threshold
[ ] bulkio.index_backfill.batch_size
[ ] bulkio.index_backfill.checkpoint_interval
[ ] bulkio.index_backfill.merge_batch_bytes
[ ] bulkio.index_backfill.merge_batch_size
[ ] bulkio.index_backfill.merge_num_workers
[ ] bulkio.ingest.flush_delay
[ ] bulkio.ingest.sender_concurrency_limit
[ ] bulkio.restore.replan_flow_frequency
[ ] bulkio.restore.replan_flow_threshold
[ ] kv.bulk_ingest.batch_size
[ ] kv.bulk_ingest.index_buffer_size
[ ] kv.bulk_ingest.max_index_buffer_size
[ ] kv.bulk_ingest.max_pk_buffer_size
[ ] kv.bulk_ingest.pk_buffer_size
[ ] kv.bulk_ingest.stream_external_ssts.suffix_cache_size
[ ] kv.bulk_io_write.concurrent_addsstable_as_writes_requests
[ ] kv.bulk_io_write.concurrent_addsstable_requests
[ ] kv.bulk_io_write.concurrent_export_requests
[ ] kv.bulk_io_write.max_rate
[ ] kv.bulk_io_write.restore_node_concurrency
[ ] kv.bulk_io_write.small_write_size
[ ] kv.bulk_sst.max_allowed_overage
[ ] kv.bulk_sst.max_request_time
[ ] kv.bulk_sst.sync_size
[ ] kv.bulk_sst.target_size
[ ] rocksdb.ingest_backpressure.l0_file_count_threshold
[ ] rocksdb.ingest_backpressure.max_delay
[ ] rocksdb.min_wal_sync_interval
[ ] schemachanger.backfiller.buffer_size
[ ] schemachanger.backfiller.max_buffer_siz

Notes from Matt:

Cluster settings abound!

I think bool settings are a good place to start. Check these searches:

Rough criteria are:

Has defaulted to true for a long time, say 2 major versions.
- Ditto for false!
We don’t see a reason that a user would or should ever change the setting moving forward.

We do want to keep settings around for new functionality that needs maturing (i.e. feature flags), or known cases of a client needing to do things differently than default.

Let’s call out / debate individual settings as comments.

Jira issue: CRDB-19292

blathers-crl[bot] commented 2 years ago

cc @cockroachdb/bulk-io

shermanCRL commented 2 years ago

(Out of scope but I’d love to see a Docs page for every setting -- why you would use it, how it interacts with other settings, risks & trade-offs. cc @kathancox)

msbutler commented 1 year ago

fwiw, I just ran our restore tpccInc roachtest on 23.1. i.e.: "RESTORE DATABASE tpcc FROM '/2022/09/07-000000.00' IN 'gs://cockroach-fixtures/tpcc-incrementals-22.2?AUTH=implicit' AS OF SYSTEM TIME '2022-09-07 12:15:00' WITH detached"

On a cluster with the following topology: roachprod create $CLUSTER -n 4 --gce-machine-type="n1-standard-8" --gce-pd-volume-size=1000 --local-ssd=false

and increasing kv.bulk_io_write.concurrent_addsstable_requests and kv.bulk_io_write.restore_node_concurrency from 1 to 5 had no measurable effect on throughput.

cockroachdb / cockroach

bulk: audit existing cluster settings and update documentation #87356