Altinity / clickhouse-backup

Tool for easy backup and restore for ClickHouse® using object storage for backup files.
https://altinity.com
Other
1.29k stars 227 forks source link

GCS, `compression_format: none`, upload not uploaded count.txt, and download was failed #810

Closed adudzic closed 9 months ago

adudzic commented 10 months ago
# clickhouse-backup --version
Version:     2.4.4
Git Commit:  bd22bb41f5278ba75dc6498bf064b825f3195669
Build Date:  2023-11-07
# clickhouse-server --version
ClickHouse server version 23.9.2.56 (official build).

Clickhouse-server is in this case deployed as a cluster on GKE using clickhouse-operator. Clickhouse-backup is included in the pods as a sidecar.

I am creating a backup to GCS using create_remote command. Clickhouse-backup returns success in the /status api call, and also returns no errors in the logs

2024/01/11 00:06:16.136171  info done                      backup=1704931206 duration=6m5.814s operation=upload size=3.65
GiB

When trying to do restore of the backup, some of the files necessary are missing on the remote storage

 {"command":"restore_remote -rm 1704931206","status":"error","start":"2024-01-11 06:13:43","finish":"2024-01-11 06:16:37","error":"can't attach data parts for table '<db_name>.<table_name>': code: 226, message: No count.txt in part 1704758400_82912_82912_0"}

In the clickhouse-backup logs, there is a warning for gcs.PutFile:

2024/01/11 00:03:02.901800  warn gcs.PutFile: can't close writer: googleapi: Error 503: We encountered an internal error.
 Please try again., backendError
2024/01/11 00:03:03.988733  info done                      backup=1704931206 duration=1m25.194s operation=upload size=3.2
7MiB table=<db_name>.<table_name>

I think that this should be retried as a failed attempt to upload, but it doesn't seem to be, I do not see anything in the logs that would indicate so.

Sanitized backup config

# clickhouse-backup print-config
general:
    remote_storage: gcs
    max_file_size: 0
    disable_progress_bar: true
    backups_to_keep_local: 0
    backups_to_keep_remote: 30
    log_level: info
    allow_empty_backups: true
    download_concurrency: 2
    upload_concurrency: 2
    use_resumable_state: true
    restore_schema_on_cluster: ""
    upload_by_part: true
    download_by_part: true
    restore_database_mapping: {}
    retries_on_failure: 3
    retries_pause: 30s
    watch_interval: 1h
    full_interval: 24h
    watch_backup_name_template: shard{shard}-{type}-{time:20060102150405}
    sharded_operation_mode: ""
    cpu_nice_priority: 15
    io_nice_priority: idle
    retriesduration: 30s
    watchduration: 1h0m0s
    fullduration: 24h0m0s
clickhouse:
    username: xxx
    password: xxx
    host: localhost
    port: 9000
    disk_mapping: {}
    skip_tables:
        - system.*
        - INFORMATION_SCHEMA.*
        - information_schema.*
        - _temporary_and_external_tables.*
    skip_table_engines:
        - Kafka
        - MaterializedView
    timeout: 5m
    freeze_by_part: false
    freeze_by_part_where: ""
    use_embedded_backup_restore: false
    embedded_backup_disk: ""
    backup_mutations: true
    restore_as_attach: false
    check_parts_columns: true
    secure: false
    skip_verify: false
    sync_replicated_tables: false
    log_sql_queries: true
    config_dir: /etc/clickhouse-server/
    restart_command: exec:systemctl restart clickhouse-server
    ignore_not_exists_error_during_freeze: true
    check_replicas_before_attach: true
    tls_key: ""
    tls_cert: ""
    tls_ca: ""
    debug: false
gcs:
    credentials_file: ""
    credentials_json: |
        {
          ...
        }
    credentials_json_encoded: ""
    bucket: xxx
    path: xxx/{cluster}/shard-{shard}
    object_disk_path: ""
    compression_level: 1
    compression_format: none
    debug: false
    endpoint: ""
    storage_class: STANDARD
    object_labels: {}
    custom_storage_class_map: {}
    client_pool_size: 6
custom:
    upload_command: ""
    download_command: ""
    list_command: ""
    delete_command: ""
    command_timeout: 4h
    commandtimeoutduration: 4h0m0s
Slach commented 10 months ago

Try to use latest 2.4.17 and compression_format: tar

{"command":"restore_remote -rm 1704931206","status":"error","start":"2024-01-11 06:13:43","finish":"2024-01-11 06:16:37","error":"can't attach data parts for table '.': code: 226, message: No count.txt in part 1704758400_82912_82912_0"}

it means count.txt didn't present on remote GCS bucket

2024/01/11 00:03:02.901800 warn gcs.PutFile: can't close writer: googleapi: Error 503: We encountered an internal error. Please try again., backendError

this error shall be retried and shall be properly handled if retries doesn't help look details

https://github.com/Altinity/clickhouse-backup/blob/master/pkg/backup/upload.go#L532-L545 and https://github.com/Altinity/clickhouse-backup/blob/master/pkg/storage/general.go#L587-L594

Warning also could just print in this line https://github.com/Altinity/clickhouse-backup/blob/master/pkg/storage/general.go#L602 and it means PutFile was success, but we can't close descriptor, and this is weird.

Could you provide the whole upload logs? Could you share result of following command

kubectl exec -n <your-name-space> <chi-your-pod-0-0-0> -c clickhouse-backup -- bash -c "GCS_DEBUG=1 LOG_LEVEL=debug clickhouse-backup create_remote test_backup --table=db.table"
Slach commented 10 months ago

@adudzic any news from your side?

Slach commented 10 months ago

@adudzic do you still failed to restore?

Slach commented 9 months ago

close after topic starter inactivity