Altinity / clickhouse-backup

Tool for easy backup and restore for ClickHouse® using object storage for backup files.
https://altinity.com
Other
1.29k stars 226 forks source link

Not able to restore backups - not found after download backup #950

Closed LukaszMarchewka closed 4 months ago

LukaszMarchewka commented 4 months ago

restore_remote operation isn't deterministic and sometimes throws errors like error one of Download go-routine return error: /bitnami/clickhouse/data/backup/incremental_2024-07-10T10-00-00/shadow/otel/otel_traces/default/20240710_2207_2526_27 not found after download backup

2024/07/11 13:39:49.685739  info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2024/07/11 13:39:49.698327  info clickhouse connection success: tcp://localhost:9000 logger=clickhouse
2024/07/11 13:39:49.698359  info SELECT value FROM `system`.`build_options` where name='VERSION_INTEGER' logger=clickhouse
2024/07/11 13:39:49.699908  info SELECT countIf(name='type') AS is_disk_type_present, countIf(name='object_storage_type') AS is_object_storage_type_present, countIf(name='free_space') AS is_free_space_present, countIf(name='disks') AS is_storage_policy_present FROM system.columns WHERE database='system' AND table IN ('disks','storage_policies')  logger=clickhouse
2024/07/11 13:39:49.782025  info SELECT d.path, any(d.name) AS name, any(d.type) AS type, min(d.free_space) AS free_space, groupUniqArray(s.policy_name) AS storage_policies FROM system.disks AS d  LEFT JOIN (SELECT policy_name, arrayJoin(disks) AS disk FROM system.storage_policies) AS s ON s.disk = d.name GROUP BY d.path logger=clickhouse
2024/07/11 13:39:49.787164  info SELECT max(toInt64(bytes_on_disk * 1.02)) AS max_file_size FROM system.parts logger=clickhouse
2024/07/11 13:39:49.790511  info SELECT count() AS is_macros_exists FROM system.tables WHERE database='system' AND name='macros'  SETTINGS empty_result_for_aggregation_by_empty_set=0 logger=clickhouse
2024/07/11 13:39:49.792822  info SELECT macro, substitution FROM system.macros logger=clickhouse
2024/07/11 13:39:49.794467  info SELECT count() AS is_macros_exists FROM system.tables WHERE database='system' AND name='macros'  SETTINGS empty_result_for_aggregation_by_empty_set=0 logger=clickhouse
2024/07/11 13:39:49.797423  info SELECT macro, substitution FROM system.macros logger=clickhouse
2024/07/11 13:39:50.388097  info done                      backup=incremental_2024-07-10T10-00-00 duration=5ms logger=backuper operation=download size=485B table_metadata=otel.otel_migrations
2024/07/11 13:39:50.392838  info done                      backup=incremental_2024-07-10T10-00-00 duration=5ms logger=backuper operation=download size=1007B table_metadata=otel.otel_resources
2024/07/11 13:39:50.397182  info done                      backup=incremental_2024-07-10T10-00-00 duration=4ms logger=backuper operation=download size=4.56KiB table_metadata=otel.otel_traces
2024/07/11 13:39:50.400900  info done                      backup=incremental_2024-07-10T10-00-00 duration=4ms logger=backuper operation=download size=1.37KiB table_metadata=otel.otel_traces_field_values
2024/07/11 13:39:50.405545  info done                      backup=incremental_2024-07-10T10-00-00 duration=5ms logger=backuper operation=download size=1.45KiB table_metadata=otel.otel_traces_trace_id_ts
2024/07/11 13:39:50.410344  info done                      backup=incremental_2024-07-10T10-00-00 duration=5ms logger=backuper operation=download size=650B table_metadata=otel.otel_traces_field_values_fields_mv
2024/07/11 13:39:50.414693  info done                      backup=incremental_2024-07-10T10-00-00 duration=4ms logger=backuper operation=download size=586B table_metadata=otel.otel_traces_field_values_resource_attr_mv
2024/07/11 13:39:50.418784  info done                      backup=incremental_2024-07-10T10-00-00 duration=4ms logger=backuper operation=download size=567B table_metadata=otel.otel_traces_field_values_span_attr_mv
2024/07/11 13:39:50.423108  info done                      backup=incremental_2024-07-10T10-00-00 duration=4ms logger=backuper operation=download size=453B table_metadata=otel.otel_traces_trace_id_ts_mv
2024/07/11 13:39:50.987886  info done                      backupName=incremental_2024-07-10T09-50-00 duration=3ms logger=backuper operation=downloadTableMetadataIfNotExists size=485B table_metadata_diff=otel.otel_migrations
2024/07/11 13:39:51.491250  info done                      backupName=incremental_2024-07-10T09-40-00 duration=4ms logger=backuper operation=downloadTableMetadataIfNotExists size=485B table_metadata_diff=otel.otel_migrations
2024/07/11 13:39:51.988591  info done                      backupName=incremental_2024-07-10T09-30-00 duration=3ms logger=backuper operation=downloadTableMetadataIfNotExists size=485B table_metadata_diff=otel.otel_migrations
2024/07/11 13:39:52.491152  info done                      backupName=incremental_2024-07-10T09-20-00 duration=3ms logger=backuper operation=downloadTableMetadataIfNotExists size=485B table_metadata_diff=otel.otel_migrations
2024/07/11 13:39:53.084388  info done                      backupName=incremental_2024-07-10T09-10-00 duration=97ms logger=backuper operation=downloadTableMetadataIfNotExists size=485B table_metadata_diff=otel.otel_migrations
2024/07/11 13:39:53.587790  info done                      backupName=full_2024-07-10T09-05-00 duration=3ms logger=backuper operation=downloadTableMetadataIfNotExists size=527B table_metadata_diff=otel.otel_migrations
2024/07/11 13:39:53.598338  info done                      diff_parts=1 duration=3.175s logger=backuper operation=downloadDiffParts
2024/07/11 13:39:53.598376  info done                      backup=incremental_2024-07-10T10-00-00 duration=3.175s logger=backuper operation=download_data progress=1/9 size=301B table=otel.otel_migrations version=2.5.20
2024/07/11 13:39:53.690554  info done                      diff_parts=0 duration=0s logger=backuper operation=downloadDiffParts
2024/07/11 13:39:53.690588  info done                      backup=incremental_2024-07-10T10-00-00 duration=92ms logger=backuper operation=download_data progress=2/9 size=39.79KiB table=otel.otel_resources version=2.5.20
2024/07/11 13:39:55.089715  info done                      backupName=incremental_2024-07-10T09-50-00 duration=3ms logger=backuper operation=downloadTableMetadataIfNotExists size=4.33KiB table_metadata_diff=otel.otel_traces
2024/07/11 13:39:55.690702  info done                      backupName=incremental_2024-07-10T09-40-00 duration=3ms logger=backuper operation=downloadTableMetadataIfNotExists size=4.64KiB table_metadata_diff=otel.otel_traces
2024/07/11 13:39:56.190481  info done                      backupName=incremental_2024-07-10T09-30-00 duration=4ms logger=backuper operation=downloadTableMetadataIfNotExists size=4.24KiB table_metadata_diff=otel.otel_traces
2024/07/11 13:39:56.786005  info done                      backupName=incremental_2024-07-10T09-20-00 duration=4ms logger=backuper operation=downloadTableMetadataIfNotExists size=4.25KiB table_metadata_diff=otel.otel_traces
2024/07/11 13:39:57.289419  info done                      backupName=incremental_2024-07-10T09-10-00 duration=3ms logger=backuper operation=downloadTableMetadataIfNotExists size=4.62KiB table_metadata_diff=otel.otel_traces
2024/07/11 13:39:57.885684  info done                      backupName=full_2024-07-10T09-05-00 duration=3ms logger=backuper operation=downloadTableMetadataIfNotExists size=4.82KiB table_metadata_diff=otel.otel_traces
2024/07/11 13:42:37.584951  info done                      diff_parts=0 duration=0s logger=backuper operation=downloadDiffParts
2024/07/11 13:42:37.584980  info done                      backup=incremental_2024-07-10T10-00-00 duration=0s logger=backuper operation=download_data progress=6/9 size=0B table=otel.otel_traces_field_values_fields_mv version=2.5.20
2024/07/11 13:42:37.585015  info done                      diff_parts=0 duration=0s logger=backuper operation=downloadDiffParts
2024/07/11 13:42:37.585028  info done                      backup=incremental_2024-07-10T10-00-00 duration=0s logger=backuper operation=download_data progress=7/9 size=0B table=otel.otel_traces_field_values_resource_attr_mv version=2.5.20
2024/07/11 13:42:37.585068  info done                      diff_parts=0 duration=0s logger=backuper operation=downloadDiffParts
2024/07/11 13:42:37.585083  info done                      backup=incremental_2024-07-10T10-00-00 duration=0s logger=backuper operation=download_data progress=8/9 size=0B table=otel.otel_traces_field_values_span_attr_mv version=2.5.20
2024/07/11 13:42:37.585126  info done                      diff_parts=0 duration=0s logger=backuper operation=downloadDiffParts
2024/07/11 13:42:37.585142  info done                      backup=incremental_2024-07-10T10-00-00 duration=0s logger=backuper operation=download_data progress=9/9 size=0B table=otel.otel_traces_trace_id_ts_mv version=2.5.20
2024/07/11 13:42:37.585200  info clickhouse connection closed logger=clickhouse
2024/07/11 13:42:37.585215 error one of Download go-routine return error: /bitnami/clickhouse/data/backup/incremental_2024-07-10T10-00-00/shadow/otel/otel_traces/default/20240710_2207_2526_27 not found after download backup
command terminated with exit code 1

The missing file exists on the remote storage: image but not on the local file system: image

I have executed the restore operation after few hours and everything was fine (the same backup), the missing directory has been created on the local storage: image

I have analyzed logs and I see there the failed execution starts new download but never complete it: image Link to the source code starting these downloads: https://github.com/Altinity/clickhouse-backup/blob/be523f5ba78046df70eeee0578c3d18d5c0a6e8b/pkg/backup/download.go#L644

Slach commented 4 months ago

let's figure out with your setup

could you share

clickhouse-backup print-config without sensitive credentials?

Slach commented 4 months ago

looks like do you try to use bitnami docker container? is it bitnami clickhouse helm chart? or just container?

Slach commented 4 months ago

could you share

cat /bitnami/clickhouse/data/backup/incremental_2024-07-10T10-00-00/metadata/otel/otel_traces.json

LukaszMarchewka commented 4 months ago

configuration

general:
    remote_storage: s3
    max_file_size: 1073741824
    backups_to_keep_local: -1
    backups_to_keep_remote: 308
    log_level: info
    allow_empty_backups: false
    download_concurrency: 1
    upload_concurrency: 1
    upload_max_bytes_per_second: 0
    download_max_bytes_per_second: 0
    object_disk_server_side_copy_concurrency: 32
    use_resumable_state: true
    restore_schema_on_cluster: default
    upload_by_part: true
    download_by_part: true
    restore_database_mapping: {}
    restore_table_mapping: {}
    retries_on_failure: 3
    retries_pause: 30s
    watch_interval: 1h
    full_interval: 24h
    watch_backup_name_template: shard{shard}-{type}-{time:20060102150405}
    sharded_operation_mode: none
    cpu_nice_priority: 15
    io_nice_priority: idle
    rbac_backup_always: true
    rbac_conflict_resolution: recreate
    retriesduration: 30s
    watchduration: 1h0m0s
    fullduration: 24h0m0s
clickhouse:
    username: admin
    password: admin
    host: localhost
    port: 9000
    disk_mapping: {}
    skip_tables:
        - system.*
        - INFORMATION_SCHEMA.*
        - information_schema.*
    skip_table_engines: []
    timeout: 5m
    freeze_by_part: false
    freeze_by_part_where: ""
    use_embedded_backup_restore: false
    embedded_backup_disk: ""
    backup_mutations: true
    restore_as_attach: false
    check_parts_columns: true
    secure: false
    skip_verify: false
    sync_replicated_tables: true
    log_sql_queries: true
    config_dir: /bitnami/clickhouse/etc
    restart_command: exec:systemctl restart clickhouse-server
    ignore_not_exists_error_during_freeze: true
    check_replicas_before_attach: true
    tls_key: ""
    tls_cert: ""
    tls_ca: ""
    max_connections: 1
    debug: false
s3:
    access_key: xxx
    secret_key: xxx
    bucket: xxx
    endpoint: xxx
    region: us-east-1
    acl: private
    assume_role_arn: ""
    force_path_style: false
    path: ""
    object_disk_path: ""
    disable_ssl: true
    compression_level: 1
    compression_format: tar
    sse: ""
    sse_kms_key_id: ""
    sse_customer_algorithm: ""
    sse_customer_key: ""
    sse_customer_key_md5: ""
    sse_kms_encryption_context: ""
    disable_cert_verification: false
    use_custom_storage_class: false
    storage_class: STANDARD
    custom_storage_class_map: {}
    concurrency: 1
    part_size: 0
    max_parts_count: 10000
    allow_multipart_download: false
    object_labels: {}
    request_payer: ""
    check_sum_algorithm: ""
    debug: false
gcs:
    credentials_file: ""
    credentials_json: ""
    credentials_json_encoded: ""
    embedded_access_key: ""
    embedded_secret_key: ""
    skip_credentials: false
    bucket: ""
    path: ""
    object_disk_path: ""
    compression_level: 1
    compression_format: tar
    debug: false
    force_http: false
    endpoint: ""
    storage_class: STANDARD
    object_labels: {}
    custom_storage_class_map: {}
    client_pool_size: 6
    chunk_size: 0
cos:
    url: ""
    timeout: 2m
    secret_id: ""
    secret_key: ""
    path: ""
    compression_format: tar
    compression_level: 1
    debug: false
api:
    listen: localhost:7171
    enable_metrics: true
    enable_pprof: false
    username: ""
    password: ""
    secure: false
    certificate_file: ""
    private_key_file: ""
    ca_cert_file: ""
    ca_key_file: ""
    create_integration_tables: false
    integration_tables_host: ""
    allow_parallel: false
    complete_resumable_after_restart: true
    watch_is_main_process: false
ftp:
    address: ""
    timeout: 2m
    username: ""
    password: ""
    tls: false
    skip_tls_verify: false
    path: ""
    object_disk_path: ""
    compression_format: tar
    compression_level: 1
    concurrency: 6
    debug: false
sftp:
    address: ""
    port: 22
    username: ""
    password: ""
    key: ""
    path: ""
    object_disk_path: ""
    compression_format: tar
    compression_level: 1
    concurrency: 6
    debug: false
azblob:
    endpoint_schema: https
    endpoint_suffix: core.windows.net
    account_name: ""
    account_key: ""
    sas: ""
    use_managed_identity: false
    container: ""
    path: ""
    object_disk_path: ""
    compression_level: 1
    compression_format: tar
    sse_key: ""
    buffer_size: 0
    buffer_count: 3
    max_parts_count: 256
    timeout: 4h
    debug: false
custom:
    upload_command: ""
    download_command: ""
    list_command: ""
    delete_command: ""
    command_timeout: 4h
    commandtimeoutduration: 4h0m0s

helm-chart: bitnami image: altinity

@Slach I see your comment on https://github.com/Altinity/clickhouse-backup/issues/949 , is it possible that the error is caused by concurrent execution? I run backups in the background.

Slach commented 4 months ago

I run backups in the background.

how exactly you run backup? via REST API / SQL or directly execute clickhouse-backup binary inside pod with clickhouse-server?

Slach commented 4 months ago

could you share

cat /bitnami/clickhouse/data/backup/incremental_2024-07-10T10-00-00/metadata/otel/otel_traces.json ?

LukaszMarchewka commented 4 months ago

I use clickhouse-backup binary tool

metadata file (I hope you asked about this one):

{
  "files": {
   "default": [
    "default_20240710_2207_2526_27.tar",
    "default_20240710_2527_2765_26.tar",
    "default_20240710_2766_2771_1.tar",
    "default_20240710_2772_2772_0.tar",
    "default_20240710_2773_2773_0.tar",
    "default_20240710_2774_2774_0.tar"
   ]
  },
  "table": "otel_traces",
  "database": "otel",
  "parts": {
   "default": [
    {
     "name": "20240703_0_0_0",
     "required": true
    },
    {
     "name": "20240703_10_10_0",
     "required": true
    },
    {
     "name": "20240703_1_1_0",
     "required": true
    },
    {
     "name": "20240703_2_2_0",
     "required": true
    },
    {
     "name": "20240703_3_3_0",
     "required": true
    },
    {
     "name": "20240703_4_4_0",
     "required": true
    },
    {
     "name": "20240703_5_5_0",
     "required": true
    },
    {
     "name": "20240703_6_6_0",
     "required": true
    },
    {
     "name": "20240703_7_7_0",
     "required": true
    },
    {
     "name": "20240703_8_8_0",
     "required": true
    },
    {
     "name": "20240703_9_9_0",
     "required": true
    },
    {
     "name": "20240704_0_0_0",
     "required": true
    },
    {
     "name": "20240704_10_10_0",
     "required": true
    },
    {
     "name": "20240704_1_1_0",
     "required": true
    },
    {
     "name": "20240704_2_2_0",
     "required": true
    },
    {
     "name": "20240704_3_3_0",
     "required": true
    },
    {
     "name": "20240704_4_4_0",
     "required": true
    },
    {
     "name": "20240704_5_5_0",
     "required": true
    },
    {
     "name": "20240704_6_6_0",
     "required": true
    },
    {
     "name": "20240704_7_7_0",
     "required": true
    },
    {
     "name": "20240704_8_8_0",
     "required": true
    },
    {
     "name": "20240704_9_9_0",
     "required": true
    },
    {
     "name": "20240705_0_0_0",
     "required": true
    },
    {
     "name": "20240705_1_1_0",
     "required": true
    },
    {
     "name": "20240705_2_2_0",
     "required": true
    },
    {
     "name": "20240705_3_3_0",
     "required": true
    },
    {
     "name": "20240705_4_4_0",
     "required": true
    },
    {
     "name": "20240705_5_5_0",
     "required": true
    },
    {
     "name": "20240705_6_6_0",
     "required": true
    },
    {
     "name": "20240705_7_7_0",
     "required": true
    },
    {
     "name": "20240710_0_1282_29",
     "required": true
    },
    {
     "name": "20240710_1283_2206_29",
     "required": true
    },
    {
     "name": "20240710_2207_2526_27"
    },
    {
     "name": "20240710_2527_2765_26"
    },
    {
     "name": "20240710_2766_2771_1"
    },
    {
     "name": "20240710_2772_2772_0"
    },
    {
     "name": "20240710_2773_2773_0"
    },
    {
     "name": "20240710_2774_2774_0"
    }
   ]
  },
  "query": "CREATE TABLE otel.otel_traces UUID '1ad4ab1f-6762-3ca3-b540-5d86aa3dca82' (`Timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)), `ResourceRef` UUID CODEC(ZSTD(1)), `TraceId` String CODEC(ZSTD(1)), `SpanId` String CODEC(ZSTD(1)), `ParentSpanId` String CODEC(ZSTD(1)), `TraceState` String CODEC(ZSTD(1)), `SpanName` LowCardinality(String) CODEC(ZSTD(1)), `SpanKind` LowCardinality(String) CODEC(ZSTD(1)), `ServiceName` LowCardinality(String) CODEC(ZSTD(1)), `ScopeName` String CODEC(ZSTD(1)), `ScopeVersion` String CODEC(ZSTD(1)), `SpanAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)), `Duration` Int64 CODEC(ZSTD(1)), `StatusCode` LowCardinality(String) CODEC(ZSTD(1)), `StatusMessage` String CODEC(ZSTD(1)), `SpanParentType` LowCardinality(String) CODEC(ZSTD(1)), `Events.Timestamp` Array(DateTime64(9)) CODEC(ZSTD(1)), `Events.Name` Array(LowCardinality(String)) CODEC(ZSTD(1)), `Events.Attributes` Array(Map(LowCardinality(String), String)) CODEC(ZSTD(1)), `Links.TraceId` Array(String) CODEC(ZSTD(1)), `Links.SpanId` Array(String) CODEC(ZSTD(1)), `Links.TraceState` Array(String) CODEC(ZSTD(1)), `Links.Attributes` Array(Map(LowCardinality(String), String)) CODEC(ZSTD(1)), INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1, INDEX idx_span_attr_key mapKeys(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1, INDEX idx_span_attr_value mapValues(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1, INDEX idx_duration Duration TYPE minmax GRANULARITY 1, INDEX idx_scope_name ScopeName TYPE bloom_filter(0.001) GRANULARITY 1, INDEX idx_scope_version ScopeVersion TYPE bloom_filter(0.001) GRANULARITY 1) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') PARTITION BY toDate(Timestamp) ORDER BY (ServiceName, ResourceRef, SpanName, toUnixTimestamp(Timestamp), TraceId) TTL toDateTime(Timestamp) + toIntervalDay(7) SETTINGS index_granularity = 8192, ttl_only_drop_parts = 1",
  "size": {
   "default": 11470627
  },
  "total_bytes": 1803692416,
  "metadata_only": false
}
Slach commented 4 months ago

hm, ok, looks weird 20240710_2207_2526_27 should present after download

ok. let's try from scratch, please share full logs as text

clickhouse-backup delete local incremental_2024-07-10T10-00-00
LOG_LEVEL=debug clickhouse-backup download incremental_2024-07-10T10-00-00
LukaszMarchewka commented 4 months ago

It works now (as I have mentioned it happen sometimes). Maybe it is caused by backups running in the background. I'm going to disable it and test.

Slach commented 4 months ago

Maybe two different instances of clickhouse-backup got race condition(like *.s3 cache) and one instance drop content of partially downloaded backup ... but not sure

ok. create new issue if it will happen again