cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.04k stars 3.79k forks source link

Update drtprod script for drt-large and workload-large #126807

Closed nameisbhaskar closed 1 month ago

nameisbhaskar commented 3 months ago

Currently there are a lot of manual steps to be done or parameters to be added to operate on drt-large and workload-large. This ticket is to ensure that the steps are added to the drtprod script so that it reduces manual interventions.

drt-large

  1. start - The parameters should be part of the script: ./scripts/drtprod start drt-large --binary ./cockroach --args=--log="file-defaults: {dir: 'logs', max-group-size: 1GiB}" --store-count=16 --restart=false
  2. as a part of the start, the following cronjob should be added: ./scripts/drtprod run drt-large -- "sudo systemctl unmask cron.service ; sudo systemctl enable cron.service ; echo \"crontab -l ; echo '@reboot sleep 100 && ~/cockroach.sh' | crontab -\" > t.sh ; sh t.sh ; rm t.sh"
  3. load-balancer: load balancer ./scripts/drtprod load-balancer create drt-large
  4. Alter range replicas for running workload (should be a separate command):
    ALTER RANGE timeseries CONFIGURE ZONE USING num_replicas=5;
    ALTER RANGE timeseries CONFIGURE ZONE USING num_voters=5;
    ALTER RANGE default CONFIGURE ZONE USING num_replicas=5;
    ALTER RANGE default CONFIGURE ZONE USING num_voters=5;
  5. All the steps to enable drt-large should be a single command - like setup. This should include - create, start and load balancer, alter range replicas.

workload

  1. tpcc workload init - should be taking the cluster workload cluster name as input and should automatically use drt-<large/chaos> based on the workload name. The IP should be extracted based on drtprod pgurl drt-<large/chaos>:1. All parameters like warehouses, regions should be constants.
nohup ./workload init tpcc \
    --data-loader=IMPORT \
    --partitions=3 \
    --warehouses=150000 \
    --survival-goal region \
    --regions=northamerica-northeast2,us-east5,us-central1 \
    'postgres://roachprod:cockroachdb@10.188.0.60:26257?sslcert=certs%2Fclient.roachprod.crt&sslkey=certs%2Fclient.roachprod.key&sslmode=verify-full&sslrootcert=certs%2Fca.crt' &
  1. Disable validation constraint: drtprod sql drt-large:1 -- -e "SET CLUSTER SETTING bulkio.import.constraint_validation.unsafe.enabled=false" JIRA ticket - https://cockroachlabs.atlassian.net/browse/CRDB-40145
  2. Run tpcc workload: Fix the attached script on tpcc run workload (esp. creating the workload run scripts in each node):

    
    for NODE in $(seq 1 $NUM_REGIONS)
    do
    NODE_OFFSET=$(($(($(($NODE - 1))*$NODES_PER_REGION))+1)) 
    LAST_NODE_IN_REGION=$(($NODE_OFFSET+$NODES_PER_REGION-1))
    
    # Since we're running a number of workers much smaller than the number of
    # warehouses, we have to do some strange math here. Workers are assigned to
    # warehouses in order (i.e. worker 1 will target warehouse 1). The
    # complication is that when we're partitioning the workload such that workers in
    # region 1 should only target warehouses in region 1, the workload binary will
    # not assign a worker if the warehouse is not in the specified region. As a
    # result, we must pass in a number of workers that is large enough to allow
    #  us to reach the specified region, and then add the actual number of workers
    #  we want to run.
    EFFECTIVE_NUM_WORKERS=$(($(($TPCC_WAREHOUSES/$NUM_REGIONS))*$(($NODE-1))+$NUM_WORKERS))
    
    PGURLS_REGION=$(./bin/roachprod pgurl --secure $CLUSTER:$NODE_OFFSET-$LAST_NODE_IN_REGION --external)
    
    cat <<EOF >/tmp/tpcc_run.sh
    #!/usr/bin/env bash

j=0 while true; do echo ">> Starting tpcc workload" ((j++)) LOG=./tpcc_\$j.txt ./workload run tpcc \ --ramp=10m \ --conns=$NUM_CONNECTIONS \ --workers=$EFFECTIVE_NUM_WORKERS \ --warehouses=$TPCC_WAREHOUSES \ --max-rate=$MAX_RATE \ --duration=$RUN_DURATION \ --wait=false \ --partitions=3 \ --partition-affinity=$(($NODE-1)) \ --tolerate-errors \ $PGURLS_REGION \ --survival-goal region \ --regions=$REGIONS | tee \$LOG done


[drt-setup-spot-chaos.txt](https://github.com/user-attachments/files/16123784/drt-setup-spot-chaos.txt)

Jira issue: CRDB-40147
vidit-bhat commented 3 months ago

Point 3 under drt-large is not needed separately. Could go into the setup command though.