Closed vanyasvl closed 2 years ago
looks weird, thanks for reporting could you setup S3_DEBUG=1 or add
s3:
debug: true
moreover, if you run clickhouse-backup server
could you add
api:
enable_pprof: true
listen: 0.0.0.0:7171
and try to get pprof profile when stuck?
go tool pprof http://clickhohse-backup-ip:7171/debug/pprof/profile?seconds=30 > cpuprofile.log
and share cpuprofile.log ?
Ok, here is stucked upload See files in attach
clickhouse-backup list remote | grep 08:31
clickhouse_28-02-2022_08:31:19 ??? 01/01/0001 00:00:00 remote broken (can't stat metadata.json)
curl clickhouse:7171/backup/actions
{"command":"create clickhouse_28-02-2022_08:31:19","status":"success","start":"2022-02-28 08:31:19","finish":"2022-02-28 08:31:26"}
{"command":"upload clickhouse_28-02-2022_08:31:19","status":"in progress","start":"2022-02-28 08:31:27"}
clickhuse-backup.log.zip pprof.clickhouse-backup.samples.cpu.003.pb.gz
@vanyasvl you are man! thanks a lot will try to figure out
hmm, pb.gz contains only 2 samples
are you sure you capture 30 seconds?
maybe strace -p $(pgrep clickhouse-backup)
will help?
Ok, I’m try to reproduce it tomorrow and make new capture and strace
Thanks a lot, your debug is really helpful
go tool pprof "http://clickhouse:7171/debug/pprof/profile?seconds=300"
Fetching profile over HTTP from http://clickhouse:7171/debug/pprof/profile?seconds=300
Saved profile in pprof.clickhouse-backup.samples.cpu.008.pb.gz
File: clickhouse-backup
Type: cpu
Time: Mar 2, 2022 at 9:15am (EET)
Duration: 300s, Total samples = 20ms (0.0067%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) exit
could you try to
go tool pprof "http://clickhouse:7171/debug/pprof/goroutine?seconds=300"
Accroding to shared strace.zip
let's make strace -p $(pgrep clickhouse-backup)
again when
inside strace find something like
write(5, "\0", 1) = 1
and run
lsof -d <number_of_file_descriptor_from_strace>
we need to figure out why golang go-routines are "stuck"
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5, "\0", 1) = 1
# lsof -d 5
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
systemd 1 root 5u a_inode 0,14 0 13847 [signalfd]
systemd-j 789 root 5u unix 0xffff8900546f0400 0t0 33749 /run/systemd/journal/socket type=DGRAM
systemd-u 837 root 5u unix 0xffff89005707b800 0t0 30886 type=DGRAM
systemd-n 953 systemd-network 5u a_inode 0,14 0 13847 [timerfd]
systemd-r 1148 systemd-resolve 5u a_inode 0,14 0 13847 [signalfd]
systemd-t 1149 systemd-timesync 5u a_inode 0,14 0 13847 [signalfd]
accounts- 1173 root 5u unix 0xffff88f85404dc00 0t0 65604 type=STREAM
dbus-daem 1175 messagebus 5u netlink 0t0 49597 AUDIT
irqbalanc 1182 root 5u unix 0xffff88f856b40c00 0t0 44071 /run/irqbalance//irqbalance1182.sock type=STREAM
networkd- 1183 root 5u unix 0xffff88f853df2c00 0t0 54293 type=STREAM
systemd-l 1192 root 5u a_inode 0,14 0 13847 [signalfd]
polkitd 1211 root 5u a_inode 0,14 0 13847 [eventfd]
container 81501 root 5u unix 0xffff88f840e9b800 0t0 2525210 /run/containerd/containerd.sock type=STREAM
dockerd 81995 root 5u a_inode 0,14 0 13847 [eventpoll]
docker 83587 root 5u unix 0xffff88f7f24a3800 0t0 2503563 type=STREAM
container 83703 root 5u a_inode 0,14 0 13847 [eventpoll]
dmesg_exp 83769 root 5u IPv4 2528308 0t0 TCP localhost:29091 (LISTEN)
docker 102243 root 5u unix 0xffff88f70a871400 0t0 3244093 type=STREAM
container 102270 root 5u a_inode 0,14 0 13847 [eventpoll]
node_expo 102291 nobody 5r FIFO 0,13 0t0 3235283 pipe
rsyslogd 136793 syslog 5r REG 0,5 0 4026532095 /proc/kmsg
atopacctd 396335 root 5u netlink 0t0 19273936 GENERIC
clckhouse 620042 clickhouse 5w REG 253,1 6355466 60820183 /var/log/clickhouse-server/clickhouse-server.log
clickhous 620052 clickhouse 5w REG 253,1 6355466 60820183 /var/log/clickhouse-server/clickhouse-server.log
systemd 653940 sid 5u a_inode 0,14 0 13847 [signalfd]
clickhous 752490 sid 5u IPv4 50377592 0t0 TCP localhost:53754->localhost:9000 (CLOSE_WAIT)
docker 755293 root 5u unix 0xffff88fd95f24000 0t0 54323128 type=STREAM
container 755322 root 5u a_inode 0,14 0 13847 [eventpoll]
docker 766980 root 5u unix 0xffff88f7f0fce800 0t0 72400250 type=STREAM
container 767008 root 5u a_inode 0,14 0 13847 [eventpoll]
nginx 767030 root 5w FIFO 0,13 0t0 72414436 pipe
nginx 767072 systemd-timesync 5w FIFO 0,13 0t0 72414436 pipe
journalct 970631 root 5r REG 253,1 134217728 60820337 /var/log/journal/a9e3b97df44946de90b17a91111b9cc1/system@531c7752133f4d57b137be6b69f4a8f1-0000000000199d60-0005d90ff9001522.journal
clickhous 1022448 clickhouse 5w FIFO 0,13 0t0 115363772 pipe
atop 1026739 root 5u a_inode 0,14 0 13847 [perf_event]
sshd 1029361 root 5u CHR 5,2 0t0 89 /dev/ptmx
sshd 1029498 sid 5u unix 0xffff88f00dd76000 0t0 116404217 type=STREAM
sshd 1030829 root 5u CHR 5,2 0t0 89 /dev/ptmx
sshd 1030948 sid 5u unix 0xffff88f101b39400 0t0 116581885 type=STREAM
lsof 1030981 root 5w FIFO 0,13 0t0 116581020 pipe
You can message me in telegram @vanyasvl for realtime troubleshooting
And stace with -y
flag:
strace -p 1022448 -y
strace: Process 1022448 attached
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5<pipe:[115363772]>, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5<pipe:[115363772]>, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5<pipe:[115363772]>, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5<pipe:[115363772]>, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5<pipe:[115363772]>, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5<pipe:[115363772]>, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(5<pipe:[115363772]>, "\0", 1) = 1
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1e04950, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(10<socket:[115423663]>, "HTTP/1.1 200 OK\r\nContent-Encodin"..., 2443) = 2443
futex(0xc000680150, FUTEX_WAKE_PRIVATE, 1) = 1
read(10<socket:[115423663]>, 0xc0003ca000, 4096) = -1 EAGAIN (Resource temporarily unavailable)
And active tcp sockets in ss
output
Look like there is a stange pipe, i don't know which pipe is mean and why it's look like broken
clickhous 1022448 clickhouse 5w FIFO 0,13 0t0 115363772 pipe
we look the same situation
general:
remote_storage: s3
max_file_size: 1073741824
disable_progress_bar: true
backups_to_keep_local: 6
backups_to_keep_remote: 6
log_level: info
allow_empty_backups: false
download_concurrency: 12
upload_concurrency: 12
restore_schema_on_cluster: ""
upload_by_part: true
download_by_part: true
clickhouse:
username: gc_rw
password: *******
host: 127.0.0.1
port: 9000
disk_mapping: {}
skip_tables:
- system.*
timeout: 5m
freeze_by_part: false
secure: false
skip_verify: false
sync_replicated_tables: true
log_sql_queries: false
config_dir: /etc/clickhouse-server/
restart_command: systemctl restart clickhouse-server
ignore_not_exists_error_during_freeze: true
debug: false
s3:
access_key: *****
secret_key: ******
bucket: cephs3-front-1
endpoint: ***/clickhouse-backup
region: default
acl: private
assume_role_arn: ""
force_path_style: false
path: ch-1
disable_ssl: true
compression_level: 1
compression_format: zstd
sse: ""
disable_cert_verification: false
storage_class: STANDARD
concurrency: 1
part_size: 0
debug: false
Hey guys, we have faced same situation, tried 2 times.. stucking in the end. Any workaround so far?
our config:
general:
remote_storage: s3
max_file_size: 1073741824
disable_progress_bar: true
backups_to_keep_local: 0
backups_to_keep_remote: 0
log_level: info
allow_empty_backups: false
download_concurrency: 16
upload_concurrency: 16
restore_schema_on_cluster: ""
upload_by_part: true
download_by_part: true
clickhouse:
username: default
password: ""
host: localhost
port: 9000
disk_mapping: {}
skip_tables:
- system.*
- INFORMATION_SCHEMA.*
- information_schema.*
timeout: 5m
freeze_by_part: false
secure: false
skip_verify: false
sync_replicated_tables: false
log_sql_queries: false
config_dir: /etc/clickhouse-server/
restart_command: systemctl restart clickhouse-server
ignore_not_exists_error_during_freeze: true
debug: false
s3:
access_key: ***
secret_key: ***
bucket: ***
endpoint: fra1.digitaloceanspaces.com
region: us-east-1
acl: private
assume_role_arn: ""
force_path_style: false
path: /backup
disable_ssl: false
compression_level: 1
compression_format: tar
sse: ""
disable_cert_verification: false
storage_class: STANDARD
concurrency: 2
part_size: 0
debug: false
@dkokot could you remove
general:
max_file_size: 1073741824
from your config and try to altinity/clickhouse-backup:latest
docker image?
@dkokot could you remove
general: max_file_size: 1073741824
from your config and try to
altinity/clickhouse-backup:latest
docker image?
Thats default value, which one should i use for overwriting it?
Currenly im running it like this:
docker run --rm -it --network host -v "/var/lib/clickhouse:/var/lib/clickhouse" \
-e REMOTE_STORAGE="s3" \
-e DISABLE_PROGRESS_BAR="true" \
-e S3_ENDPOINT="fra1.digitaloceanspaces.com" \
-e S3_BUCKET="***" \
-e S3_ACCESS_KEY="***" \
-e S3_SECRET_KEY="***" \
-e S3_PATH="/backup" \
-e S3_CONCURRENCY="2" \
alexakulov/clickhouse-backup
Im also tried master and 1.3.0 version
Tried to compile from source(using master branch here) with no luck, same freeze.. my db quite big(arrounf 50k files in backup folder, 950gb), can i somehow help to debug ?
Hello there
After carefully checking digitalocean spaces limitation, i found main reason for freeze - their api dont fully support list-of-objects-v2, which are using when quering s3 storage.
Anyway, I found version 1.2.4 from afinity more stable and now using it with FTP backend from hetzner. This setup works like a charm in our situation(2 replicated servers, around 1 tb of raw data in CH)
I also found really poor documentation regarding increment backups and will be happy to add some notes for readme soon. Hope it will be helpful for others.
BR, Dmytro
@dkokot thanks a lot for clarify
I also found really poor documentation regarding increment backups and will be happy to add some notes for readme soon.
What exactly part of documentation you missed? Did you see https://github.com/Altinity/clickhouse-backup/blob/master/Examples.md#how-do-incremental-backups-work-to-remote-storage?
Anyway, feel free make Pull Request, make clickhouse-backup better together
Hello. I have an issue with upload freezing to S3 on the new 1.3.1 Nothing interesting in the logs
and no new messages after in several hours, no network activity from clickhouse-backup, it's just stuck 1.3.0 woked good for upload
If I enable progressbar, it shouws one file upload progress