Open eivanov89 opened 1 year ago
Hello, I am Blathers. I am here to help you get the issue triaged.
It looks like you have not filled out the issue in the format of any of our templates. To best assist you, we advise you to use one of these templates.
I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.
cc @cockroachdb/test-eng
@eivanov89 I added a tag so this issue gets triaged by the appropriate team.
I'm not sure which team owns workloads. I tagged test eng for now. Please feel free to tag a more appropriate team :) @cockroachdb/test-eng
Judging by the error message, it appears to be a transient (cluster) issue during the load. As a sanity check, I tried to reproduce it, but the load completed successfully. @eivanov89 Is this something that's happened sporadically or are you able to reproduce it?
roachprod create -n 15 --gce-machine-type n2-standard-32 stan-test
roachprod stage stan-test release v22.2.0
roachprod start stan-test --store-count 1
n15
roachprod ssh stan-test:15
./cockroach workload init ycsb --data-loader=IMPORT --drop --insert-count 300000000 'postgresql://root@localhost:26257?sslmode=disable' --concurrency 512
I221227 22:45:16.531737 1 ccl/workloadccl/fixture.go:318 [-] 1 starting import of 1 tables
I221227 23:28:27.630623 37 ccl/workloadccl/fixture.go:481 [-] 2 imported 388 GiB in usertable table (300003400 rows, 0 index entries, took 43m11.075598062s, 153.30 MiB/s)
I221227 23:28:27.630877 1 ccl/workloadccl/fixture.go:326 [-] 3 imported 388 GiB bytes in 1 tables (took 43m11.099031787s, 153.30 MiB/s)
cc @cockroachdb/disaster-recovery
Is this something that's happened sporadically or are you able to reproduce it?
@ srosenberg, it happens pretty often, but not every time.
Is this something that's happened sporadically or are you able to reproduce it?
@ srosenberg, it happens pretty often, but not every time.
@eivanov89 Are there any other workloads (including another instance of yscb
) running concurrently? Would you be able to describe how/where you're running the workload(s) to help us reproduce the issue?
Is this something that's happened sporadically or are you able to reproduce it?
@ srosenberg, it happens pretty often, but not every time.
@eivanov89 Are there any other workloads (including another instance of
yscb
) running concurrently? Would you be able to describe how/where you're running the workload(s) to help us reproduce the issue?
@ srosenberg, during the init (load) phase I run the only instance of ycsb
. I have a bare metal cluster: 8 nodes. And a separate server to run ycsb init, it has HAProxy instance runnning. No other concurrent activity at all. Also on previous version of cockroach I ran ycsb many times and didn't have this issue, only after switching to newer version it started to happen. I will try to check the logs.
It is still reproducible on recent 22.2.6.
It is still reproducible on recent 22.2.6.
Thanks for reporting! Unfortunately, we can't reproduce it in our environment, and we run many different ycsb workloads nightly, against all the supported release branches. Is there any chance you could share a gist of your HAProxy configuration, and as much as possible, the rest of the configuration? i.e., the closer we can get to reproducing your environment, the higher chance of us reproducing the error and fixing it.
Sorry, I should have provided more information including logs.
In https://github.com/cockroachdb/cockroach/issues/98438 I describe my configuration.
Here is my HAProxy config:
global
maxconn 250000
defaults
mode tcp
maxconn 250000
retries 2
timeout connect 5s
timeout client 10m
timeout server 10m
option clitcpka
listen psql
bind :26257
mode tcp
balance roundrobin
option httpchk GET /health?ready=1
# please note that I removed 'check port', though in previous setup
# it was here and same issue was happening
server cockroach1 vla-dev04-000:26257
server cockroach2 vla-dev04-002:26258
server cockroach3 vla-dev04-000:26259
server cockroach4 vla-dev04-000:26258
server cockroach5 vla-dev04-001:26260
server cockroach6 vla-dev04-001:26257
server cockroach7 vla-dev04-002:26257
server cockroach8 vla-dev04-002:26259
server cockroach9 vla-dev04-001:26259
server cockroach10 vla-dev04-002:26260
server cockroach11 vla-dev04-001:26258
server cockroach12 vla-dev04-000:26260
Last time I reproduced it yesterday:
Fri 10 Mar 2023 10:40:28 PM UTC: Loading data: /home/eivanov89/cockroach/cockroach workload init ycsb --data-loader=IMPORT --drop --insert-count 2000000000 --insert-hash=false postgresql://root@localhost:26257?sslmode=disable --concurrency 512 --workload a
I230310 22:40:30.762877 1 ccl/workloadccl/fixture.go:318 [-] 1 starting import of 1 tables
Error: importing fixture: importing table usertable: pq: relation "usertable" is offline: importing
I attach the logs, which contain import job ID. Also please note that it constantly happens on big imports like 300M or 2B rows, while never see it on small ones.
I run the command and it finishes with error in around 10 minutes:
~/cockroach/cockroach workload init ycsb --data-loader=IMPORT --drop --insert-count 300000000 'postgresql://root@localhost:26257?sslmode=disable' --concurrency 512 I221226 17:42:14.197406 1 ccl/workloadccl/fixture.go:318 [-] 1 starting import of 1 tables Error: importing fixture: importing table usertable: pq: relation “usertable” is offline: importing
In web UI I see that the import job continues and after some more time (in my case extra 15 minutes) successfully ends. It happens in 22.2.0 and didn't happen in previous version.
Jira issue: CRDB-22838