ClickHouse / ClickHouse

ClickHouse® is a real-time analytics DBMS
https://clickhouse.com
Apache License 2.0
37.25k stars 6.86k forks source link

Backup to S3 failed with Unexpected part name error #62600

Open bessonov87 opened 6 months ago

bessonov87 commented 6 months ago

ClickHouse versions: 23.8 and 24.3.

I tried to start a backup using the following query:

BACKUP DATABASE prod TO S3('https://*******************.s3.us-west-2.amazonaws.com/base_backup/', 'AKIAW***********', '**************') SETTINGS s3_storage_class = 'GLACIER_IR' ASYNC

Here is the result:

SELECT *
FROM system.backups
ORDER BY start_time DESC
FORMAT Vertical

Query id: 7946f785-02d2-40fb-8c46-e7ed9311ff85

Row 1:
──────
id:                da7ca47b-9771-41ef-9cd0-8868cbeecbca
name:              S3('https://*************************.s3.us-west-2.amazonaws.com/base_backup/', 'AKIA************', '[HIDDEN]')
base_backup_name:
query_id:          c0c75171-2e96-4d32-9489-346c0580b524
status:            BACKUP_FAILED
error:             Code: 233. DB::Exception: Unexpected part name: 20190704_20190708_93270_163536_1890 for format version: 1: While checking data of table prod.RequestLog. (BAD_DATA_PART_NAME), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000cbcedbb
1. DB::Exception::Exception<String const&, StrongTypedef<unsigned int, DB::MergeTreeDataFormatVersionTag>&>(int, FormatStringHelperImpl<std::type_identity<String const&>::type, std::type_identity<StrongTypedef<unsigned int, DB::MergeTreeDataFormatVersionTag>&>::type>, String const&, StrongTypedef<unsigned int, DB::MergeTreeDataFormatVersionTag>&) @ 0x0000000011ec6aeb
2. DB::MergeTreePartInfo::fromPartName(String const&, StrongTypedef<unsigned int, DB::MergeTreeDataFormatVersionTag>) @ 0x0000000011ec65c4
3. DB::BackupCoordinationReplicatedTables::prepare() const @ 0x000000000fcdbd29
4. DB::BackupCoordinationReplicatedTables::getPartNames(String const&, String const&) const @ 0x000000000fcdbb3d
5. DB::BackupCoordinationLocal::getReplicatedPartNames(String const&, String const&) const @ 0x000000000fce9559
6. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<DB::StorageReplicatedMergeTree::backupData(DB::BackupEntriesCollector&, String const&, std::optional<absl::InlinedVector<std::shared_ptr<DB::IAST>, 7ul, std::allocator<std::shared_ptr<DB::IAST>>>> const&)::$_2, void ()>>(std::__function::__policy_storage const*) @ 0x000000001198fdab
7. DB::BackupEntriesCollector::run() @ 0x000000000fc5d7ce
8. DB::BackupsWorker::doBackup(std::shared_ptr<DB::IBackup>&, std::shared_ptr<DB::ASTBackupQuery> const&, String const&, String const&, DB::BackupInfo const&, DB::BackupSettings, std::shared_ptr<DB::IBackupCoordination>, std::shared_ptr<DB::Context const> const&, std::shared_ptr<DB::Context>) @ 0x000000000fc96f99
9. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<DB::BackupsWorker::startMakingBackup(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context const> const&)::$_1, void ()>>(std::__function::__policy_storage const*) @ 0x000000000fca504c
10. ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>) @ 0x000000000cc7b379
11. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x000000000cc7ebda
12. void* std::__thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void ThreadPoolImpl<std::thread>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>>(void*) @ 0x000000000cc7d9ed
13. ? @ 0x00007525e9e94ac3
14. ? @ 0x00007525e9f26850
 (version 24.3.2.23 (official build))
start_time:        2024-04-12 11:33:18
end_time:          2024-04-12 11:33:19
num_files:         0
total_size:        0
num_entries:       0
uncompressed_size: 0
compressed_size:   0
files_read:        0
bytes_read:        0
ProfileEvents:     {'Query':1,'InitialQuery':1,'QueriesWithSubqueries':1,'QueryTimeMicroseconds':508,'OtherQueryTimeMicroseconds':508,'NetworkSendElapsedMicroseconds':68,'NetworkSendBytes':1562,'SelectedRows':1,'SelectedBytes':46,'ContextLock':9,'RealTimeMicroseconds':640,'UserTimeMicroseconds':503,'SoftPageFaults':5,'OSCPUWaitMicroseconds':58,'OSCPUVirtualTimeMicroseconds':529}

We have several clusters and I have tested backups on 3 of them. Everything is working fine on one of them but doesn't work on the other two backups failed with the same error described above. All of them had version 23.8

What can be the reason for such behavior and what can be done to fix this issue?

Here is the information about that part in case is needed:

SELECT *
FROM system.parts
WHERE name = '20190704_20190708_93270_163536_1890'
FORMAT Vertical

Query id: 4c51b553-61cf-4ca8-8960-b1bcf8af67da

Row 1:
──────
partition:                             201907
name:                                  20190704_20190708_93270_163536_1890
uuid:                                  00000000-0000-0000-0000-000000000000
part_type:                             Wide
active:                                1
marks:                                 197
rows:                                  1613286
bytes_on_disk:                         78923178
data_compressed_bytes:                 78891264
data_uncompressed_bytes:               106972152
primary_key_size:                      3546
marks_bytes:                           28368
secondary_indices_compressed_bytes:    0
secondary_indices_uncompressed_bytes:  0
secondary_indices_marks_bytes:         0
modification_time:                     2023-05-12 07:29:10
remove_time:                           1970-01-01 00:00:00
refcount:                              1
min_date:                              2019-07-04
max_date:                              2019-07-08
min_time:                              1970-01-01 00:00:00
max_time:                              1970-01-01 00:00:00
partition_id:                          201907
min_block_number:                      93270
max_block_number:                      163536
level:                                 1890
data_version:                          93270
primary_key_bytes_in_memory:           4334
primary_key_bytes_in_memory_allocated: 4844
is_frozen:                             0
database:                              prod
table:                                 RequestLog
engine:                                ReplicatedMergeTree
disk_name:                             default
path:                                  /data/clickhouse/store/bac/bac077ef-bcb1-48e3-bdbb-dfc37e7f7437/20190704_20190708_93270_163536_1890/
hash_of_all_files:                     86d2c069b07d9fb0bbd09611e117e823
hash_of_uncompressed_files:            80dd164bcbc96d9d2ad2619a2e9e34e9
uncompressed_hash_of_compressed_files: 69df4fffcd77f1ba30c3719b37d16dc3
delete_ttl_info_min:                   1970-01-01 00:00:00
delete_ttl_info_max:                   1970-01-01 00:00:00
move_ttl_info.expression:              []
move_ttl_info.min:                     []
move_ttl_info.max:                     []
default_compression_codec:             LZ4
recompression_ttl_info.expression:     []
recompression_ttl_info.min:            []
recompression_ttl_info.max:            []
group_by_ttl_info.expression:          []
group_by_ttl_info.min:                 []
group_by_ttl_info.max:                 []
rows_where_ttl_info.expression:        []
rows_where_ttl_info.min:               []
rows_where_ttl_info.max:               []
projections:                           []
visible:                               1
creation_tid:                          (1,1,'00000000-0000-0000-0000-000000000000')
removal_tid_lock:                      0
removal_tid:                           (0,0,'00000000-0000-0000-0000-000000000000')
creation_csn:                          1
removal_csn:                           0
has_lightweight_delete:                0
last_removal_attempt_time:             1970-01-01 00:00:00
removal_state:                         Cleanup thread hasn't seen this part yet
bessonov87 commented 5 months ago

While we were waiting for some kind of answer here, we decided to try to do something ourselves. As a result, we were helped by recreating the tables with a new syntax. We accidentally noticed that the errors were related to parts of tables created using the old syntax. We have databases that were created more than 6 years ago and there are quite a lot of old tables in there. So after recreating tables with changing old syntax like this

ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/migration_clickhouse', '{replica}', eventDate, (version, apply_time), 8192)

to the current one like this

ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/migration_clickhouse', '{replica}')
PARTITION BY toYYYYMM(eventDate)
ORDER BY (version, apply_time)
SETTINGS index_granularity = 8192

errors disappear and backups begin to be created.

I'm not sure if this is a bug or not, but at least it's unexpected behavior.