Altinity / clickhouse-backup

Tool for easy backup and restore for ClickHouse® using object storage for backup files.
https://altinity.com
Other
1.28k stars 226 forks source link

Cannot restore backups: `error can't acquire semaphore during downloadTableData: context canceled` #442

Closed fzyzcjy closed 2 years ago

fzyzcjy commented 2 years ago

Logs:


2022/05/17 02:02:42.466903 error can't acquire semaphore during downloadTableData: context canceled
2022/05/17 02:02:42.588974 warn remote file backup extension 2022-05-16T19-00-08/shadow/yplusplus_prod_op_main/request_info/default_202204_286934_286938_1.tar.gz not equal with gzip
2022/05/17 02:02:42.621763 error one of Download go-routine return error: one of downloadTableData go-routine return error: handling file: /202204_28_97490_21423/remote_ip.bin: unexpected EOF
Slach commented 2 years ago

Could you try 1.4.0 and share LOG_LEVEL=debug clickhouse-backup download 2022-05-16T19-00-08 results?

fzyzcjy commented 2 years ago

Sure. I will do that and report it.

fzyzcjy commented 2 years ago

2022/05/17 09:13:58.048872 debug start download from 2022-05-16T19-00-08/shadow/yplusplus_prod_op_main/keycloak_log/default_202109_7_7_1014.tar.gz
2022/05/17 09:13:58.083370  warn remote file backup extension 2022-05-16T19-00-08/shadow/yplusplus_prod_op_main/keycloak_log/default_202109_7_7_1014.tar.gz not equal with gzip
2022/05/17 09:13:59.106082 debug finish download from 2022-05-16T19-00-08/shadow/yplusplus_prod_op_main/keycloak_log/default_202109_7_7_1014.tar.gz
2022/05/17 09:13:59.106156 debug start download from 2022-05-16T19-00-08/shadow/yplusplus_prod_op_main/keycloak_log/default_202202_17_17_61.tar.gz
2022/05/17 09:13:59.135966  warn remote file backup extension 2022-05-16T19-00-08/shadow/yplusplus_prod_op_main/keycloak_log/default_202202_17_17_61.tar.gz not equal with gzip
2022/05/17 09:13:59.144870 debug finish download from 2022-05-16T19-00-08/shadow/yplusplus_prod_op_main/keycloak_log/default_202202_17_17_61.tar.gz
2022/05/17 09:14:02.417926 error can't acquire semaphore during downloadTableData: context canceled
2022/05/17 09:14:02.417990 error can't acquire semaphore during downloadTableData: context canceled
2022/05/17 09:14:02.417996 error can't acquire semaphore during downloadTableData: context canceled
2022/05/17 09:14:02.418002 error can't acquire semaphore during downloadTableData: context canceled
2022/05/17 09:14:02.418012 error can't acquire semaphore during downloadTableData: context canceled

about 900000 lines of this ===> "error can't acquire semaphore during downloadTableData: context canceled"

...

2022/05/17 09:14:07.734238 error can't acquire semaphore during downloadTableData: context canceled
2022/05/17 09:14:07.734243 error can't acquire semaphore during downloadTableData: context canceled
2022/05/17 09:14:07.734247 error can't acquire semaphore during downloadTableData: context canceled
2022/05/17 09:14:07.734252 error can't acquire semaphore during downloadTableData: context canceled
2022/05/17 09:14:07.734259 error can't acquire semaphore during downloadTableData: context canceled
2022/05/17 09:14:07.734263 error can't acquire semaphore during downloadTableData: context canceled
2022/05/17 09:14:07.734316 debug start download from 2022-05-16T19-00-08/shadow/yplusplus_prod_op_main/request_info/default_202106_2_2_1.tar.gz
2022/05/17 09:14:07.764562  warn remote file backup extension 2022-05-16T19-00-08/shadow/yplusplus_prod_op_main/request_info/default_202106_2_2_1.tar.gz not equal with gzip
2022/05/17 09:14:09.481551 error can't acquire semaphore during Download: context canceled backup=2022-05-16T19-00-08 operation=download
2022/05/17 09:14:09.481617 debug start downloadTableData yplusplus_prod_in_main.request_info with concurrency=2 len(table.Files[...])=56
2022/05/17 09:14:09.481649 debug start download from 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202108_21_21_1.tar.gz
2022/05/17 09:14:09.482097 debug start download from 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202105_11_11_5906.tar.gz
2022/05/17 09:14:09.519474  warn remote file backup extension 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202108_21_21_1.tar.gz not equal with gzip

...

2022/05/17 09:14:23.690612 debug start download from 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202204_62265_267158_23809.tar.gz
2022/05/17 09:14:23.780185  warn remote file backup extension 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202204_62265_267158_23809.tar.gz not equal with gzip
2022/05/17 09:14:29.943696 debug finish download from 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202112_32_32_35752.tar.gz
2022/05/17 09:14:29.943773 debug start download from 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202205_455081_472207_7107.tar.gz
2022/05/17 09:14:29.998111  warn remote file backup extension 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202205_455081_472207_7107.tar.gz not equal with gzip
2022/05/17 09:14:32.323227 debug finish download from 2022-05-16T19-00-08/shadow/yplusplus_prod_op_main/keycloak_log/default_202111_12_12_7041.tar.gz
2022/05/17 09:14:32.323276 debug start                     operation=downloadDiffParts table=yplusplus_prod_op_main.keycloak_log
2022/05/17 09:14:32.323358  info done                      diff_parts=0 duration=0s operation=downloadDiffParts
2022/05/17 09:14:32.323379  info done                      backup=2022-05-16T19-00-08 duration=38.429s operation=download_data size=23.23MiB table=yplusplus_prod_op_main.keycloak_log
2022/05/17 09:14:32.798363 debug finish download from 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202205_455081_472207_7107.tar.gz
2022/05/17 09:14:32.798436 debug start download from 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202103_8_8_19779.tar.gz
2022/05/17 09:14:32.853390  warn remote file backup extension 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202103_8_8_19779.tar.gz not equal with gzip
2022/05/17 09:15:05.018054 debug finish download from 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202103_8_8_19779.tar.gz
2022/05/17 09:15:05.018118 debug start download from 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202107_17_17_1.tar.gz

...

2022/05/17 09:21:32.807784  warn remote file backup extension 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202203_42_42_8242.tar.gz not equal with gzip
2022/05/17 09:21:52.956816 debug finish download from 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202109_22_22_47339.tar.gz
2022/05/17 09:21:52.956906 debug start download from 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202202_38_38_10076.tar.gz
2022/05/17 09:21:53.018275  warn remote file backup extension 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202202_38_38_10076.tar.gz not equal with gzip
2022/05/17 09:22:21.367415 debug finish download from 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202203_42_42_8242.tar.gz
2022/05/17 09:22:27.134402 debug finish download from 2022-05-16T19-00-08/shadow/yplusplus_prod_in_main/request_info/default_202202_38_38_10076.tar.gz
2022/05/17 09:22:27.134445 debug start                     operation=downloadDiffParts table=yplusplus_prod_in_main.request_info
2022/05/17 09:22:27.134674  info done                      diff_parts=0 duration=0s operation=downloadDiffParts
2022/05/17 09:22:27.134696  info done                      backup=2022-05-16T19-00-08 duration=8m17.653s operation=download_data size=1.77GiB table=yplusplus_prod_in_main.request_info
2022/05/17 09:22:27.134731 error one of Download go-routine return error: one of downloadTableData go-routine return error: handling file: /202110_11_11_3243/event_time.bin: unexpected EOF
fzyzcjy commented 2 years ago

Do not know what is happening... Notice that, the whole backup is <2GB, so it should be quite small.

But indeed it is taking minutes to download (and finally failed):

bash-5.1# LOG_LEVEL=debug clickhouse-backup download 2022-05-16T19-00-08 > /tmp/a.log 2>&1
bash-5.1# echo $?
1
fzyzcjy commented 2 years ago

tried another file, same problem

bash-5.1# LOG_LEVEL=debug clickhouse-backup download 2022-05-15T19-00-07 > /tmp/a.log 2>&1
bash-5.1# echo $?
1

---
the log:

2022/05/17 09:27:33.680814 debug start download from 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/static_linker_log/default_202203_25_25_20053.tar.gz
2022/05/17 09:27:33.952826  warn remote file backup extension 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/static_linker_log/default_202203_25_25_20053.tar.gz not equal with gzip
2022/05/17 09:27:41.034210 debug finish download from 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/static_linker_log/default_202203_25_25_20053.tar.gz
2022/05/17 09:27:41.034291 debug start download from 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/static_linker_log/default_202204_97491_157166_21905.tar.gz
2022/05/17 09:27:41.102117  warn remote file backup extension 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/static_linker_log/default_202204_97491_157166_21905.tar.gz not equal with gzip
2022/05/17 09:27:41.285526 debug finish download from 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/static_linker_log/default_202204_97491_157166_21905.tar.gz
2022/05/17 09:27:41.285578 debug start download from 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/static_linker_log/default_202108_8_8_0.tar.gz
2022/05/17 09:27:41.317110  warn remote file backup extension 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/static_linker_log/default_202108_8_8_0.tar.gz not equal with gzip
2022/05/17 09:27:41.350590 debug finish download from 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/static_linker_log/default_202108_8_8_0.tar.gz
2022/05/17 09:27:41.350640 debug start download from 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/static_linker_log/default_202112_17_17_28524.tar.gz
2022/05/17 09:27:41.421570  warn remote file backup extension 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/static_linker_log/default_202112_17_17_28524.tar.gz not equal with gzip
2022/05/17 09:27:45.976110 error can't acquire semaphore during downloadTableData: context canceled
2022/05/17 09:27:45.976133 error can't acquire semaphore during downloadTableData: context canceled
2022/05/17 09:27:45.976139 error can't acquire semaphore during downloadTableData: context canceled
2022/05/17 09:27:45.976144 error can't acquire semaphore during downloadTableData: context canceled
2022/05/17 09:27:45.976149 error can't acquire semaphore during downloadTableData: context canceled

...

2022/05/17 09:27:48.719482 error can't acquire semaphore during downloadTableData: context canceled
2022/05/17 09:27:48.719487 error can't acquire semaphore during downloadTableData: context canceled
2022/05/17 09:27:48.719528 debug start download from 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/request_info/default_202205_472829_472829_0.tar.gz
2022/05/17 09:27:48.750560  warn remote file backup extension 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/request_info/default_202205_472829_472829_0.tar.gz not equal with gzip
2022/05/17 09:27:50.115112 error can't acquire semaphore during Download: context canceled backup=2022-05-15T19-00-07 operation=download
2022/05/17 09:27:50.115195 debug start downloadTableData yplusplus_prod_op_main.keycloak_log with concurrency=2 len(table.Files[...])=20
2022/05/17 09:27:50.115277 debug start download from 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/keycloak_log/default_202108_5_5_3066.tar.gz
2022/05/17 09:27:50.115333 debug start download from 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/keycloak_log/default_202106_2_2_126.tar.gz
2022/05/17 09:27:50.161254  warn remote file backup extension 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/keycloak_log/default_202106_2_2_126.tar.gz not equal with gzip

...
2022/05/17 09:28:07.867117 debug start download from 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/static_linker_log/default_202204_28_97490_21423.tar.gz
2022/05/17 09:28:07.969864  warn remote file backup extension 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/static_linker_log/default_202204_28_97490_21423.tar.gz not equal with gzip
2022/05/17 09:28:13.469481 debug finish download from 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/static_linker_log/default_202204_28_97490_21423.tar.gz
2022/05/17 09:28:18.403449 debug finish download from 2022-05-15T19-00-07/shadow/yplusplus_prod_op_main/static_linker_log/default_202112_17_17_28524.tar.gz
2022/05/17 09:28:18.403498 debug start                     operation=downloadDiffParts table=yplusplus_prod_op_main.static_linker_log
2022/05/17 09:28:18.403610  info done                      diff_parts=0 duration=0s operation=downloadDiffParts
2022/05/17 09:28:18.403629  info done                      backup=2022-05-15T19-00-07 duration=1m44.843s operation=download_data size=266.13MiB table=yplusplus_prod_op_main.static_linker_log
2022/05/17 09:28:18.403667 error one of Download go-routine return error: one of downloadTableData go-routine return error: handling file: /202110_11_11_3243/trace_id.bin: unexpected EOF
fzyzcjy commented 2 years ago

Btw the error can't acquire semaphore during downloadTableData: context canceled is repeated 500000 times in this experiment...

Slach commented 2 years ago

@fzyzcjy could you try with https://github.com/AlexAkulov/clickhouse-backup/releases/tag/v1.3.2 ? it shall download backup which created with 1.4.0

fzyzcjy commented 2 years ago

I just upgraded from it... seems also fails

Slach commented 2 years ago

with the same error? do you have enough free space on destination disk?

Slach commented 2 years ago

could you share df -h on destination server?

fzyzcjy commented 2 years ago

I will check it and reply :)

fzyzcjy commented 2 years ago

a lot of free space indeed.


bash-5.1# df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  97.9G     41.2G     51.7G  44% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                     7.8G         0      7.8G   0% /sys/fs/cgroup
shm                      64.0M         0     64.0M   0% /dev/shm
/dev/vda1                39.1G     13.6G     23.5G  37% /etc/hosts
/dev/vda1                39.1G     13.6G     23.5G  37% /dev/termination-log
/dev/vdb1                97.9G     41.2G     51.7G  44% /etc/hostname
/dev/vdb1                97.9G     41.2G     51.7G  44% /etc/resolv.conf
/dev/longhorn/pvc-44371947-dbdb-4bf7-9379-49addf9aff9d
                         14.7G      2.8G     11.9G  19% /var/lib/clickhouse
tmpfs                     7.8G     12.0K      7.8G   0% /run/secrets/kubernetes.io/serviceaccount
tmpfs                     7.8G         0      7.8G   0% /proc/acpi
tmpfs                    64.0M         0     64.0M   0% /proc/kcore
tmpfs                    64.0M         0     64.0M   0% /proc/keys
tmpfs                    64.0M         0     64.0M   0% /proc/timer_list
tmpfs                    64.0M         0     64.0M   0% /proc/sched_debug
tmpfs                     7.8G         0      7.8G   0% /proc/scsi
tmpfs                     7.8G         0      7.8G   0% /sys/firmware
fzyzcjy commented 2 years ago

Observation: error every time is on different files. e.g.

the experiment before:

2022/05/17 09:22:27.134731 error one of Download go-routine return error: one of downloadTableData go-routine return error: handling file: /202110_11_11_3243/event_time.bin: unexpected EOF

try again

2022/05/19 09:47:43.831474 error one of Download go-routine return error: one of downloadTableData go-routine return error: handling file: /202202_20_20_19935/trace_id.bin: unexpected EOF

try again

2022/05/19 09:53:44.154727 error one of Download go-routine return error: one of downloadTableData go-routine return error: handling file: /202110_12_12_3242/remote_addr.bin: unexpected EOF

fzyzcjy commented 2 years ago

By the way, can we know more from the error logs? I come from Java/Kotlin/Dart/C++/Rust... world where an error usually follows with a stacktrace which is quite useful.

fzyzcjy commented 2 years ago

Could you please provide some hints on what to do? This makes the clickhouse-backup completely non-useable since it cannot restore data :(

fzyzcjy commented 2 years ago

Seem to see the cause: The downloading bandwidth was very low before (~1MB/s), so maybe something timeout (not sure). Switching to ~40MB/s bandwidth, it works.

Slach commented 2 years ago

So, the root cause slow bandwidth?

Which S3 backup do you use ? AWS S3 ? or some custom backend like minio, ceph, swift?

could you also switch clickhouse-backup print-config again

I need ensure we have timeout on S3 but all timeout shall pass to high level and after any error download shall stop

fzyzcjy commented 2 years ago

So, the root cause slow bandwidth?

I guess so (not sure)

Which S3 backup do you use ? AWS S3 ? or some custom backend like minio, ceph, swift?

aws-s3-compatible storage, a provider on a small cloud.

could you also switch clickhouse-backup print-config again

bash-5.1# clickhouse-backup print-config
general:
  remote_storage: s3
  max_file_size: 0
  disable_progress_bar: true
  backups_to_keep_local: 0
  backups_to_keep_remote: 0
  log_level: info
  allow_empty_backups: false
  download_concurrency: 2
  upload_concurrency: 2
  restore_schema_on_cluster: ""
  upload_by_part: true
  download_by_part: true
clickhouse:
  username: default
  password: ..
  host: clickhouse
  port: 9000
  disk_mapping: {}
  skip_tables:
  - system.*
  - tutorial.*
  timeout: 5m
  freeze_by_part: false
  secure: false
  skip_verify: false
  sync_replicated_tables: false
  log_sql_queries: true
  config_dir: /etc/clickhouse-server/
  restart_command: systemctl restart clickhouse-server
  ignore_not_exists_error_during_freeze: true
  tls_key: ""
  tls_cert: ""
  tls_ca: ""
  debug: false
s3:
  access_key: ..
  secret_key: ..
  bucket: ..
  endpoint: ..
  region: ..
  acl: private
  assume_role_arn: ""
  force_path_style: false
  path: clickhouse
  disable_ssl: false
  compression_level: 1
  compression_format: gzip
  sse: ""
  disable_cert_verification: false
  storage_class: STANDARD
  concurrency: 1
  part_size: 0
  max_parts_count: 10000
  allow_multipart_download: false
  debug: false
gcs:
  credentials_file: ""
  credentials_json: ""
  bucket: ""
  path: ""
  compression_level: 1
  compression_format: tar
  debug: false
  endpoint: ""
cos:
  url: ""
  timeout: 2m
  secret_id: ""
  secret_key: ""
  path: ""
  compression_format: tar
  compression_level: 1
  debug: false
api:
  listen: 0.0.0.0:7171
  enable_metrics: true
  enable_pprof: false
  username: ""
  password: ""
  secure: false
  certificate_file: ""
  private_key_file: ""
  create_integration_tables: false
  allow_parallel: false
ftp:
  address: ""
  timeout: 2m
  username: ""
  password: ""
  tls: false
  path: ""
  compression_format: tar
  compression_level: 1
  concurrency: 2
  debug: false
sftp:
  address: ""
  port: 22
  username: ""
  password: ""
  key: ""
  path: ""
  compression_format: tar
  compression_level: 1
  concurrency: 1
  debug: false
azblob:
  endpoint_suffix: core.windows.net
  account_name: ""
  account_key: ""
  sas: ""
  use_managed_identity: false
  container: ""
  path: ""
  compression_level: 1
  compression_format: tar
  sse_key: ""
  buffer_size: 0
  buffer_count: 3
  max_parts_count: 0
Slach commented 2 years ago

issue still relevant for you?

could you try wih

s3:
 max_parts_count: 5000

in your backup config?

fzyzcjy commented 2 years ago

It works well now after the bandwidth is enlarged. Thanks for your replies :)

k0t3n commented 2 years ago

An error still exists, reproduced on multiple S3-like storages (including classic AWS) with a large table (~1000 partitions).

Steps to reproduce:

  1. Create large table with ~1000 partitions
  2. Backup to remote
  3. Create one more backup with --diff-from-remote
  4. Try to download the latest one

I didn't see sources, but I guess there is an issue with S3 files getting (not implemented pagination?)