Currently there are a lot of manual steps to be done or parameters to be added to operate on drt-large and workload-large. This ticket is to ensure that the steps are added to the drtprod script so that it reduces manual interventions.
drt-large
start - The parameters should be part of the script:
./scripts/drtprod start drt-large --binary ./cockroach --args=--log="file-defaults: {dir: 'logs', max-group-size: 1GiB}" --store-count=16 --restart=false
as a part of the start, the following cronjob should be added:
./scripts/drtprod run drt-large -- "sudo systemctl unmask cron.service ; sudo systemctl enable cron.service ; echo \"crontab -l ; echo '@reboot sleep 100 && ~/cockroach.sh' | crontab -\" > t.sh ; sh t.sh ; rm t.sh"
Alter range replicas for running workload (should be a separate command):
ALTER RANGE timeseries CONFIGURE ZONE USING num_replicas=5;
ALTER RANGE timeseries CONFIGURE ZONE USING num_voters=5;
ALTER RANGE default CONFIGURE ZONE USING num_replicas=5;
ALTER RANGE default CONFIGURE ZONE USING num_voters=5;
All the steps to enable drt-large should be a single command - like setup. This should include - create, start and load balancer, alter range replicas.
workload
tpcc workload init - should be taking the cluster workload cluster name as input and should automatically use drt-<large/chaos> based on the workload name. The IP should be extracted based on drtprod pgurl drt-<large/chaos>:1. All parameters like warehouses, regions should be constants.
Run tpcc workload: Fix the attached script on tpcc run workload (esp. creating the workload run scripts in each node):
for NODE in $(seq 1 $NUM_REGIONS)
do
NODE_OFFSET=$(($(($(($NODE - 1))*$NODES_PER_REGION))+1))
LAST_NODE_IN_REGION=$(($NODE_OFFSET+$NODES_PER_REGION-1))
# Since we're running a number of workers much smaller than the number of
# warehouses, we have to do some strange math here. Workers are assigned to
# warehouses in order (i.e. worker 1 will target warehouse 1). The
# complication is that when we're partitioning the workload such that workers in
# region 1 should only target warehouses in region 1, the workload binary will
# not assign a worker if the warehouse is not in the specified region. As a
# result, we must pass in a number of workers that is large enough to allow
# us to reach the specified region, and then add the actual number of workers
# we want to run.
EFFECTIVE_NUM_WORKERS=$(($(($TPCC_WAREHOUSES/$NUM_REGIONS))*$(($NODE-1))+$NUM_WORKERS))
PGURLS_REGION=$(./bin/roachprod pgurl --secure $CLUSTER:$NODE_OFFSET-$LAST_NODE_IN_REGION --external)
cat <<EOF >/tmp/tpcc_run.sh
#!/usr/bin/env bash
j=0
while true; do
echo ">> Starting tpcc workload"
((j++))
LOG=./tpcc_\$j.txt
./workload run tpcc \
--ramp=10m \
--conns=$NUM_CONNECTIONS \
--workers=$EFFECTIVE_NUM_WORKERS \
--warehouses=$TPCC_WAREHOUSES \
--max-rate=$MAX_RATE \
--duration=$RUN_DURATION \
--wait=false \
--partitions=3 \
--partition-affinity=$(($NODE-1)) \
--tolerate-errors \
$PGURLS_REGION \
--survival-goal region \
--regions=$REGIONS | tee \$LOG
done
Currently there are a lot of manual steps to be done or parameters to be added to operate on drt-large and workload-large. This ticket is to ensure that the steps are added to the drtprod script so that it reduces manual interventions.
drt-large
./scripts/drtprod start drt-large --binary ./cockroach --args=--log="file-defaults: {dir: 'logs', max-group-size: 1GiB}" --store-count=16 --restart=false
./scripts/drtprod run drt-large -- "sudo systemctl unmask cron.service ; sudo systemctl enable cron.service ; echo \"crontab -l ; echo '@reboot sleep 100 && ~/cockroach.sh' | crontab -\" > t.sh ; sh t.sh ; rm t.sh"
./scripts/drtprod load-balancer create drt-large
setup
. This should include - create, start and load balancer, alter range replicas.workload
drtprod pgurl drt-<large/chaos>:1
. All parameters like warehouses, regions should be constants.drtprod sql drt-large:1 -- -e "SET CLUSTER SETTING bulkio.import.constraint_validation.unsafe.enabled=false"
JIRA ticket - https://cockroachlabs.atlassian.net/browse/CRDB-40145Run tpcc workload: Fix the attached script on tpcc run workload (esp. creating the workload run scripts in each node):
j=0 while true; do echo ">> Starting tpcc workload" ((j++)) LOG=./tpcc_\$j.txt ./workload run tpcc \ --ramp=10m \ --conns=$NUM_CONNECTIONS \ --workers=$EFFECTIVE_NUM_WORKERS \ --warehouses=$TPCC_WAREHOUSES \ --max-rate=$MAX_RATE \ --duration=$RUN_DURATION \ --wait=false \ --partitions=3 \ --partition-affinity=$(($NODE-1)) \ --tolerate-errors \ $PGURLS_REGION \ --survival-goal region \ --regions=$REGIONS | tee \$LOG done