cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.91k stars 3.78k forks source link

YCSB workload init CLI fails while the import job continues and finishes successfully #94335

Open eivanov89 opened 1 year ago

eivanov89 commented 1 year ago

I run the command and it finishes with error in around 10 minutes: ~/cockroach/cockroach workload init ycsb --data-loader=IMPORT --drop --insert-count 300000000 'postgresql://root@localhost:26257?sslmode=disable' --concurrency 512 I221226 17:42:14.197406 1 ccl/workloadccl/fixture.go:318 [-] 1 starting import of 1 tables Error: importing fixture: importing table usertable: pq: relation “usertable” is offline: importing

In web UI I see that the import job continues and after some more time (in my case extra 15 minutes) successfully ends. It happens in 22.2.0 and didn't happen in previous version.

Jira issue: CRDB-22838

blathers-crl[bot] commented 1 year ago

Hello, I am Blathers. I am here to help you get the issue triaged.

It looks like you have not filled out the issue in the format of any of our templates. To best assist you, we advise you to use one of these templates.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/test-eng

jayshrivastava commented 1 year ago

@eivanov89 I added a tag so this issue gets triaged by the appropriate team.

jayshrivastava commented 1 year ago

I'm not sure which team owns workloads. I tagged test eng for now. Please feel free to tag a more appropriate team :) @cockroachdb/test-eng

srosenberg commented 1 year ago

Judging by the error message, it appears to be a transient (cluster) issue during the load. As a sanity check, I tried to reproduce it, but the load completed successfully. @eivanov89 Is this something that's happened sporadically or are you able to reproduce it?

Create a 15-node cluster, running 22.2.0

roachprod create -n 15 --gce-machine-type n2-standard-32 stan-test
roachprod stage stan-test release v22.2.0
roachprod start stan-test --store-count 1

Run the workload from n15

roachprod ssh stan-test:15

./cockroach workload init ycsb --data-loader=IMPORT --drop --insert-count 300000000 'postgresql://root@localhost:26257?sslmode=disable' --concurrency 512
I221227 22:45:16.531737 1 ccl/workloadccl/fixture.go:318  [-] 1  starting import of 1 tables
I221227 23:28:27.630623 37 ccl/workloadccl/fixture.go:481  [-] 2  imported 388 GiB in usertable table (300003400 rows, 0 index entries, took 43m11.075598062s, 153.30 MiB/s)
I221227 23:28:27.630877 1 ccl/workloadccl/fixture.go:326  [-] 3  imported 388 GiB bytes in 1 tables (took 43m11.099031787s, 153.30 MiB/s)
blathers-crl[bot] commented 1 year ago

cc @cockroachdb/disaster-recovery

eivanov89 commented 1 year ago

Is this something that's happened sporadically or are you able to reproduce it?

@ srosenberg, it happens pretty often, but not every time.

srosenberg commented 1 year ago

Is this something that's happened sporadically or are you able to reproduce it?

@ srosenberg, it happens pretty often, but not every time.

@eivanov89 Are there any other workloads (including another instance of yscb) running concurrently? Would you be able to describe how/where you're running the workload(s) to help us reproduce the issue?

eivanov89 commented 1 year ago

Is this something that's happened sporadically or are you able to reproduce it?

@ srosenberg, it happens pretty often, but not every time.

@eivanov89 Are there any other workloads (including another instance of yscb) running concurrently? Would you be able to describe how/where you're running the workload(s) to help us reproduce the issue?

@ srosenberg, during the init (load) phase I run the only instance of ycsb. I have a bare metal cluster: 8 nodes. And a separate server to run ycsb init, it has HAProxy instance runnning. No other concurrent activity at all. Also on previous version of cockroach I ran ycsb many times and didn't have this issue, only after switching to newer version it started to happen. I will try to check the logs.

eivanov89 commented 1 year ago

It is still reproducible on recent 22.2.6.

srosenberg commented 1 year ago

It is still reproducible on recent 22.2.6.

Thanks for reporting! Unfortunately, we can't reproduce it in our environment, and we run many different ycsb workloads nightly, against all the supported release branches. Is there any chance you could share a gist of your HAProxy configuration, and as much as possible, the rest of the configuration? i.e., the closer we can get to reproducing your environment, the higher chance of us reproducing the error and fixing it.

eivanov89 commented 1 year ago

Sorry, I should have provided more information including logs.

In https://github.com/cockroachdb/cockroach/issues/98438 I describe my configuration.

Here is my HAProxy config:

global
  maxconn 250000

defaults
    mode                tcp
    maxconn             250000

    retries             2
    timeout connect     5s

    timeout client      10m
    timeout server      10m

    option              clitcpka

listen psql
    bind :26257
    mode tcp
    balance roundrobin
    option httpchk GET /health?ready=1

    # please note that I removed 'check port', though in previous setup
    # it was here and same issue was happening
    server cockroach1 vla-dev04-000:26257
    server cockroach2 vla-dev04-002:26258
    server cockroach3 vla-dev04-000:26259
    server cockroach4 vla-dev04-000:26258
    server cockroach5 vla-dev04-001:26260
    server cockroach6 vla-dev04-001:26257
    server cockroach7 vla-dev04-002:26257
    server cockroach8 vla-dev04-002:26259
    server cockroach9 vla-dev04-001:26259
    server cockroach10 vla-dev04-002:26260
    server cockroach11 vla-dev04-001:26258
    server cockroach12 vla-dev04-000:26260

Last time I reproduced it yesterday:

Fri 10 Mar 2023 10:40:28 PM UTC: Loading data: /home/eivanov89/cockroach/cockroach workload init ycsb --data-loader=IMPORT --drop --insert-count 2000000000 --insert-hash=false postgresql://root@localhost:26257?sslmode=disable --concurrency 512 --workload a
I230310 22:40:30.762877 1 ccl/workloadccl/fixture.go:318  [-] 1  starting import of 1 tables
Error: importing fixture: importing table usertable: pq: relation "usertable" is offline: importing

I attach the logs, which contain import job ID. Also please note that it constantly happens on big imports like 300M or 2B rows, while never see it on small ones.

cockroach_import_issue_full.gz