Altinity / clickhouse-backup

Tool for easy backup and restore for ClickHouse® using object storage for backup files.
https://altinity.com
Other
1.29k stars 227 forks source link

Why sometime backup duration delay to hours level but no errors. #880

Closed hueiyuan closed 8 months ago

hueiyuan commented 8 months ago

Description

Hi, Everyone We found that our backup duration sometime is so fast and some of backup is slow to 6 hours. And backup duration, do not see and error problems. So does anyone have any idea can explain this?

However, Our table TTL is inner 1 day stored to disk, and exceed 1 day to 30 days would be stored to AWS S3. clickhouse-backup is sidecar in clickhouse-server, so corresponding resource request is 1 core / 1 GB and limit is 2 core/2GB

Additional questions

Slach commented 8 months ago

do you have object disk like s3 or azure

could you share

SELECT * FROM system.disks
SELECT * FROM system.storage_policies

?

According to non-empty

s3:
 object_disk_path: tiered-backup

looks like yes

so this is not "pause" actually this is server side CopyObject execution which allow you restore your data after DROP TABLE ... SYNC \ DROP DATABASE ... SYNC

try to change /etc/clickhouse-server/config.yml

general:
  log_level: debug

and share the logs

hueiyuan commented 8 months ago

@Slach Thanks for your assistance. Related sharing show the below:

Row 2: ────── name: s3_tier_cold path: /var/lib/clickhouse/disks/s3_tier_cold/ free_space: 18446744073709551615 total_space: 18446744073709551615 unreserved_space: 18446744073709551615 keep_free_space: 0 type: s3 is_encrypted: 0 is_read_only: 0 is_write_once: 0 is_remote: 1 is_broken: 0 cache_path: 2 rows in set. Elapsed: 0.002 sec.


* `SELECT * FROM system.storage_policies`

Row 1: ────── policy_name: default volume_name: default volume_priority: 1 disks: ['default'] volume_type: JBOD max_data_part_size: 0 move_factor: 0 prefer_not_to_merge: 0 perform_ttl_move_on_insert: 1 load_balancing: ROUND_ROBIN

Row 2: ────── policy_name: move_from_local_disks_to_s3 volume_name: cold volume_priority: 1 disks: ['s3_tier_cold'] volume_type: JBOD max_data_part_size: 0 move_factor: 0.1 prefer_not_to_merge: 0 perform_ttl_move_on_insert: 1 load_balancing: ROUND_ROBIN

Row 3: ────── policy_name: move_from_local_disks_to_s3 volume_name: hot volume_priority: 2 disks: ['default'] volume_type: JBOD max_data_part_size: 0 move_factor: 0.1 prefer_not_to_merge: 0 perform_ttl_move_on_insert: 1 load_balancing: ROUND_ROBIN

3 rows in set. Elapsed: 0.003 sec.



* The path `tiered-backup` of AWS S3, which is not empty and have objects.
Slach commented 8 months ago

During create backup for all tables with SETTINGS storage_policy='move_from_local_disks_to_s3' will execute s3:CopyObject into tiered-backup path in your backup bucket

we will improve speed ща incremental backups СopyObject execution for object disks data in v2.5

Slach commented 8 months ago

check

SELECT 
  count() AS parts, database, 
  uniqExact(table) AS tables, active, disk_name, 
  formatReadableSize(sum(bytes_on_disk)) 
FROM system.parts 
GROUP BY database, active, disk_name 
FORMAT Vertical
hueiyuan commented 8 months ago

@Slach Thanks for answer it. I would like to confirm what is the ETA for v2.5?

hueiyuan commented 8 months ago

check

SELECT 
  count() AS parts, database, 
  uniqExact(table) AS tables, active, disk_name, 
  formatReadableSize(sum(bytes_on_disk)) 
FROM system.parts 
GROUP BY database, active, disk_name 
FORMAT Vertical

@Slach For your information:

Row 1:
──────
parts:                                  38
database:                               otel
tables:                                 4
active:                                 1
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 1.86 GiB

Row 2:
──────
parts:                                  10462
database:                               otel
tables:                                 3
active:                                 0
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 2.98 GiB

Row 3:
──────
parts:                                  439
database:                               system
tables:                                 5
active:                                 0
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 128.27 MiB

Row 4:
──────
parts:                                  218
database:                               otel
tables:                                 4
active:                                 1
disk_name:                              s3_tier_cold
formatReadableSize(sum(bytes_on_disk)): 37.44 GiB

Row 5:
──────
parts:                                  234
database:                               system
tables:                                 7
active:                                 1
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 7.48 GiB

5 rows in set. Elapsed: 0.023 sec. Processed 11.39 thousand rows, 721.45 KB (488.22 thousand rows/s., 30.92 MB/s.)
Peak memory usage: 15.72 KiB.

--> But this our dev environment data size.

For our production size FYI:

Row 1:
──────
parts:                                  204
database:                               otel
tables:                                 5
active:                                 1
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 334.00 GiB

Row 2:
──────
parts:                                  11862
database:                               otel
tables:                                 3
active:                                 0
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 3.19 GiB

Row 3:
──────
parts:                                  571
database:                               system
tables:                                 7
active:                                 0
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 275.28 MiB

Row 4:
──────
parts:                                  220
database:                               otel
tables:                                 3
active:                                 1
disk_name:                              s3_tier_cold
formatReadableSize(sum(bytes_on_disk)): 444.90 GiB

Row 5:
──────
parts:                                  343
database:                               system
tables:                                 7
active:                                 1
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 11.09 GiB

5 rows in set. Elapsed: 0.023 sec. Processed 13.20 thousand rows, 822.72 KB (565.34 thousand rows/s., 35.24 MB/s.)
Peak memory usage: 20.98 KiB.
Slach commented 8 months ago

@Slach Thanks for answer it. I would like to confirm what is the ETA for v2.5?

subscribe to https://github.com/Altinity/clickhouse-backup/pull/843 and watch progress

hueiyuan commented 7 months ago

@Slach Does v2.5 can resolve automatically execute watch cli after watch is stopped because of some errors?

Slach commented 7 months ago

Does v2.5 can resolve automatically execute watch cli after watch is stopped because of some errors?

it resolve issue with reconnect to clickhouse-server but if backup will failure more time than allow to store full backup in full watch period then watch commands sequence will stop cause you need to figure out with your configuration before continue watch maybe we should change this behavior but please create new issue in this case