Update drtprod script for drt-large and workload-large

Currently there are a lot of manual steps to be done or parameters to be added to operate on drt-large and workload-large. This ticket is to ensure that the steps are added to the drtprod script so that it reduces manual interventions.

drt-large

start - The parameters should be part of the script: ./scripts/drtprod start drt-large --binary ./cockroach --args=--log="file-defaults: {dir: 'logs', max-group-size: 1GiB}" --store-count=16 --restart=false
as a part of the start, the following cronjob should be added: ./scripts/drtprod run drt-large -- "sudo systemctl unmask cron.service ; sudo systemctl enable cron.service ; echo \"crontab -l ; echo '@reboot sleep 100 && ~/cockroach.sh' | crontab -\" > t.sh ; sh t.sh ; rm t.sh"
load-balancer: load balancer ./scripts/drtprod load-balancer create drt-large

Alter range replicas for running workload (should be a separate command):

ALTER RANGE timeseries CONFIGURE ZONE USING num_replicas=5;
ALTER RANGE timeseries CONFIGURE ZONE USING num_voters=5;
ALTER RANGE default CONFIGURE ZONE USING num_replicas=5;
ALTER RANGE default CONFIGURE ZONE USING num_voters=5;

All the steps to enable drt-large should be a single command - like setup. This should include - create, start and load balancer, alter range replicas.

workload

tpcc workload init - should be taking the cluster workload cluster name as input and should automatically use drt-<large/chaos> based on the workload name. The IP should be extracted based on drtprod pgurl drt-<large/chaos>:1. All parameters like warehouses, regions should be constants.

nohup ./workload init tpcc \
    --data-loader=IMPORT \
    --partitions=3 \
    --warehouses=150000 \
    --survival-goal region \
    --regions=northamerica-northeast2,us-east5,us-central1 \
    'postgres://roachprod:cockroachdb@10.188.0.60:26257?sslcert=certs%2Fclient.roachprod.crt&sslkey=certs%2Fclient.roachprod.key&sslmode=verify-full&sslrootcert=certs%2Fca.crt' &

Disable validation constraint: drtprod sql drt-large:1 -- -e "SET CLUSTER SETTING bulkio.import.constraint_validation.unsafe.enabled=false" JIRA ticket - https://cockroachlabs.atlassian.net/browse/CRDB-40145

Run tpcc workload: Fix the attached script on tpcc run workload (esp. creating the workload run scripts in each node):


for NODE in $(seq 1 $NUM_REGIONS)
do
NODE_OFFSET=$(($(($(($NODE - 1))*$NODES_PER_REGION))+1)) 
LAST_NODE_IN_REGION=$(($NODE_OFFSET+$NODES_PER_REGION-1))

# Since we're running a number of workers much smaller than the number of
# warehouses, we have to do some strange math here. Workers are assigned to
# warehouses in order (i.e. worker 1 will target warehouse 1). The
# complication is that when we're partitioning the workload such that workers in
# region 1 should only target warehouses in region 1, the workload binary will
# not assign a worker if the warehouse is not in the specified region. As a
# result, we must pass in a number of workers that is large enough to allow
#  us to reach the specified region, and then add the actual number of workers
#  we want to run.
EFFECTIVE_NUM_WORKERS=$(($(($TPCC_WAREHOUSES/$NUM_REGIONS))*$(($NODE-1))+$NUM_WORKERS))

PGURLS_REGION=$(./bin/roachprod pgurl --secure $CLUSTER:$NODE_OFFSET-$LAST_NODE_IN_REGION --external)

cat <<EOF >/tmp/tpcc_run.sh
#!/usr/bin/env bash

j=0 while true; do echo ">> Starting tpcc workload" ((j++)) LOG=./tpcc_\$j.txt ./workload run tpcc \ --ramp=10m \ --conns=$NUM_CONNECTIONS \ --workers=$EFFECTIVE_NUM_WORKERS \ --warehouses=$TPCC_WAREHOUSES \ --max-rate=$MAX_RATE \ --duration=$RUN_DURATION \ --wait=false \ --partitions=3 \ --partition-affinity=$(($NODE-1)) \ --tolerate-errors \ $PGURLS_REGION \ --survival-goal region \ --regions=$REGIONS | tee \$LOG done


[drt-setup-spot-chaos.txt](https://github.com/user-attachments/files/16123784/drt-setup-spot-chaos.txt)

Jira issue: CRDB-40147

cockroachdb / cockroach

Update drtprod script for drt-large and workload-large #126807