Why sometime backup duration delay to hours level but no errors.

hueiyuan commented 8 months ago

Description

Hi, Everyone We found that our backup duration sometime is so fast and some of backup is slow to 6 hours. And backup duration, do not see and error problems. So does anyone have any idea can explain this?

However, Our table TTL is inner 1 day stored to disk, and exceed 1 day to 30 days would be stored to AWS S3. clickhouse-backup is sidecar in clickhouse-server, so corresponding resource request is 1 core / 1 GB and limit is 2 core/2GB

print-config

general:
remote_storage: s3
max_file_size: 0
disable_progress_bar: true
backups_to_keep_local: 0
backups_to_keep_remote: 50
log_level: debug
allow_empty_backups: false
download_concurrency: 1
upload_concurrency: 1
use_resumable_state: true
restore_schema_on_cluster: ""
upload_by_part: true
download_by_part: true
restore_database_mapping: {}
retries_on_failure: 3
retries_pause: 30s
watch_interval: 30m
full_interval: 24h
watch_backup_name_template: shard{shard}-{type}-{time:20060102150405}
sharded_operation_mode: ""
cpu_nice_priority: 15
io_nice_priority: idle
retriesduration: 30s
watchduration: 30m0s
fullduration: 24h0m0s
clickhouse:
username: username
password: password
host: localhost
port: 9000
disk_mapping: {}
skip_tables:
    - system.*
    - INFORMATION_SCHEMA.*
    - information_schema.*
    - _temporary_and_external_tables.*
skip_table_engines: []
timeout: 5m
freeze_by_part: false
freeze_by_part_where: ""
use_embedded_backup_restore: false
embedded_backup_disk: ""
backup_mutations: true
restore_as_attach: false
check_parts_columns: true
secure: false
skip_verify: false
sync_replicated_tables: false
log_sql_queries: true
config_dir: /etc/clickhouse-server/
restart_command: exec:systemctl restart clickhouse-server
ignore_not_exists_error_during_freeze: true
check_replicas_before_attach: true
tls_key: ""
tls_cert: ""
tls_ca: ""
max_connections: 2
debug: false
s3:
access_key: ""
secret_key: ""
bucket: ipp-clickhouse-backup-int
endpoint: ""
region: us-west-2
acl: private
assume_role_arn: arn:aws:iam::xxxx:role/test-rolw
force_path_style: true
path: backup/chi-shard-backup
object_disk_path: tiered-backup
disable_ssl: false
compression_level: 1
compression_format: zstd
sse: ""
sse_kms_key_id: ""
sse_customer_algorithm: ""
sse_customer_key: ""
sse_customer_key_md5: ""
sse_kms_encryption_context: ""
disable_cert_verification: false
use_custom_storage_class: false
storage_class: STANDARD
custom_storage_class_map: {}
concurrency: 3
part_size: 0
max_parts_count: 2000
allow_multipart_download: false
object_labels: {}
request_payer: ""
check_sum_algorithm: ""
debug: true
gcs:
credentials_file: ""
credentials_json: ""
credentials_json_encoded: ""
bucket: ""
path: ""
object_disk_path: ""
compression_level: 1
compression_format: tar
debug: false
force_http: false
endpoint: ""
storage_class: STANDARD
object_labels: {}
custom_storage_class_map: {}
client_pool_size: 6
cos:
url: ""
timeout: 2m
secret_id: ""
secret_key: ""
path: ""
compression_format: tar
compression_level: 1
debug: false
api:
listen: 0.0.0.0:7171
enable_metrics: true
enable_pprof: false
username: ""
password: ""
secure: false
certificate_file: ""
private_key_file: ""
ca_cert_file: ""
ca_key_file: ""
create_integration_tables: true
integration_tables_host: ""
allow_parallel: false
complete_resumable_after_restart: true
ftp:
address: ""
timeout: 2m
username: ""
password: ""
tls: false
skip_tls_verify: false
path: ""
object_disk_path: ""
compression_format: tar
compression_level: 1
concurrency: 6
debug: false
sftp:
address: ""
port: 22
username: ""
password: ""
key: ""
path: ""
object_disk_path: ""
compression_format: tar
compression_level: 1
concurrency: 6
debug: false
azblob:
endpoint_schema: https
endpoint_suffix: core.windows.net
account_name: ""
account_key: ""
sas: ""
use_managed_identity: false
container: ""
path: ""
object_disk_path: ""
compression_level: 1
compression_format: tar
sse_key: ""
buffer_size: 0
buffer_count: 3
max_parts_count: 256
timeout: 4h
debug: false
custom:
upload_command: ""
download_command: ""
list_command: ""
delete_command: ""
command_timeout: 4h
commandtimeoutduration: 4h0m0s

Additional questions

Will the backup time become longer as the data size increases?
If the backup is suspended, will the original clickhouse server be interrupted?

Slach commented 8 months ago

do you have object disk like s3 or azure

could you share

SELECT * FROM system.disks
SELECT * FROM system.storage_policies

?

According to non-empty

s3:
 object_disk_path: tiered-backup

looks like yes

so this is not "pause" actually this is server side CopyObject execution which allow you restore your data after DROP TABLE ... SYNC \ DROP DATABASE ... SYNC

try to change /etc/clickhouse-server/config.yml

general:
  log_level: debug

and share the logs

hueiyuan commented 8 months ago

@Slach Thanks for your assistance. Related sharing show the below:

SELECT * FROM system.disks


Row 1:
──────
name:             default
path:             /var/lib/clickhouse/
free_space:       33461338112
total_space:      52521566208
unreserved_space: 33461338112
keep_free_space:  0
type:             local
is_encrypted:     0
is_read_only:     0
is_write_once:    0
is_remote:        0
is_broken:        0
cache_path:

Row 2: ────── name: s3_tier_cold path: /var/lib/clickhouse/disks/s3_tier_cold/ free_space: 18446744073709551615 total_space: 18446744073709551615 unreserved_space: 18446744073709551615 keep_free_space: 0 type: s3 is_encrypted: 0 is_read_only: 0 is_write_once: 0 is_remote: 1 is_broken: 0 cache_path: 2 rows in set. Elapsed: 0.002 sec.


* `SELECT * FROM system.storage_policies`

Row 1: ────── policy_name: default volume_name: default volume_priority: 1 disks: ['default'] volume_type: JBOD max_data_part_size: 0 move_factor: 0 prefer_not_to_merge: 0 perform_ttl_move_on_insert: 1 load_balancing: ROUND_ROBIN

Row 2: ────── policy_name: move_from_local_disks_to_s3 volume_name: cold volume_priority: 1 disks: ['s3_tier_cold'] volume_type: JBOD max_data_part_size: 0 move_factor: 0.1 prefer_not_to_merge: 0 perform_ttl_move_on_insert: 1 load_balancing: ROUND_ROBIN

Row 3: ────── policy_name: move_from_local_disks_to_s3 volume_name: hot volume_priority: 2 disks: ['default'] volume_type: JBOD max_data_part_size: 0 move_factor: 0.1 prefer_not_to_merge: 0 perform_ttl_move_on_insert: 1 load_balancing: ROUND_ROBIN

3 rows in set. Elapsed: 0.003 sec.



* The path `tiered-backup` of AWS S3, which is not empty and have objects.

Slach commented 8 months ago

During create backup for all tables with SETTINGS storage_policy='move_from_local_disks_to_s3' will execute s3:CopyObject into tiered-backup path in your backup bucket

we will improve speed ща incremental backups СopyObject execution for object disks data in v2.5

Slach commented 8 months ago

check

SELECT 
  count() AS parts, database, 
  uniqExact(table) AS tables, active, disk_name, 
  formatReadableSize(sum(bytes_on_disk)) 
FROM system.parts 
GROUP BY database, active, disk_name 
FORMAT Vertical

hueiyuan commented 8 months ago

@Slach Thanks for answer it. I would like to confirm what is the ETA for v2.5?

hueiyuan commented 8 months ago

check

SELECT 
  count() AS parts, database, 
  uniqExact(table) AS tables, active, disk_name, 
  formatReadableSize(sum(bytes_on_disk)) 
FROM system.parts 
GROUP BY database, active, disk_name 
FORMAT Vertical

@Slach For your information:

Row 1:
──────
parts:                                  38
database:                               otel
tables:                                 4
active:                                 1
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 1.86 GiB

Row 2:
──────
parts:                                  10462
database:                               otel
tables:                                 3
active:                                 0
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 2.98 GiB

Row 3:
──────
parts:                                  439
database:                               system
tables:                                 5
active:                                 0
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 128.27 MiB

Row 4:
──────
parts:                                  218
database:                               otel
tables:                                 4
active:                                 1
disk_name:                              s3_tier_cold
formatReadableSize(sum(bytes_on_disk)): 37.44 GiB

Row 5:
──────
parts:                                  234
database:                               system
tables:                                 7
active:                                 1
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 7.48 GiB

5 rows in set. Elapsed: 0.023 sec. Processed 11.39 thousand rows, 721.45 KB (488.22 thousand rows/s., 30.92 MB/s.)
Peak memory usage: 15.72 KiB.

--> But this our dev environment data size.

For our production size FYI:

Row 1:
──────
parts:                                  204
database:                               otel
tables:                                 5
active:                                 1
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 334.00 GiB

Row 2:
──────
parts:                                  11862
database:                               otel
tables:                                 3
active:                                 0
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 3.19 GiB

Row 3:
──────
parts:                                  571
database:                               system
tables:                                 7
active:                                 0
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 275.28 MiB

Row 4:
──────
parts:                                  220
database:                               otel
tables:                                 3
active:                                 1
disk_name:                              s3_tier_cold
formatReadableSize(sum(bytes_on_disk)): 444.90 GiB

Row 5:
──────
parts:                                  343
database:                               system
tables:                                 7
active:                                 1
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 11.09 GiB

5 rows in set. Elapsed: 0.023 sec. Processed 13.20 thousand rows, 822.72 KB (565.34 thousand rows/s., 35.24 MB/s.)
Peak memory usage: 20.98 KiB.

Slach commented 8 months ago

@Slach Thanks for answer it. I would like to confirm what is the ETA for v2.5?

subscribe to https://github.com/Altinity/clickhouse-backup/pull/843 and watch progress

hueiyuan commented 7 months ago

@Slach Does v2.5 can resolve automatically execute watch cli after watch is stopped because of some errors?

Slach commented 7 months ago

Does v2.5 can resolve automatically execute watch cli after watch is stopped because of some errors?

it resolve issue with reconnect to clickhouse-server but if backup will failure more time than allow to store full backup in full watch period then watch commands sequence will stop cause you need to figure out with your configuration before continue watch maybe we should change this behavior but please create new issue in this case

Altinity / clickhouse-backup

Why sometime backup duration delay to hours level but no errors. #880

Description

Additional questions