Altinity / clickhouse-backup

Tool for easy backup and restore for ClickHouse® using object storage for backup files.
https://altinity.com
Other
1.29k stars 227 forks source link

errors during restore from diff backups #392

Closed vanyasvl closed 2 years ago

vanyasvl commented 2 years ago

Hello. I have an issue to restore data in version 1.3.0 --remote-diff backups. clickhouse-backup can't find partition in the backup.

2022/02/12 05:11:50.115058 debug findDiffFileExist not found part=20211028_124923_124929_1_3232480 tableRemoteFile=backup_11-02-2022_02:10:02/shadow/flows/flows/disk_ssd_20211028_124923_124929_1_3232480.tar tableRemotePath=backup_11-02-2022_02:10:02/shadow/flows/flows/disk_ssd_20211028_124923_124929_1_3232480.tar
2022/02/12 05:11:50.115077 debug findDiffOnePart           database=flows part=20211028_124923_124929_1_3232480 table=flows
2022/02/12 05:11:50.115088 debug findDiffOnePartArchive    database=flows part=20211028_124923_124929_1_3232480 table=flows
2022/02/12 05:11:50.139582 debug findDiffFileExist found   part=20211029_186226_187016_10_3232480 tableRemoteFile=backup_10-02-2022_14:43:25/shadow/flows/flows/disk_hdd_20211029_186226_187016_10_3232480.tar tableRemotePath=backup_10-02-2022_14:43:25/shadow/flows/flows/disk_hdd_20211029_186226_187016_10_3232480.tar
2022/02/12 05:11:50.139601 debug start download from backup_10-02-2022_14:43:25/shadow/flows/flows/disk_hdd_20211029_186226_187016_10_3232480.tar operation=downloadDiffRemoteFile
2022/02/12 05:11:50.155120 debug findDiffFileExist not found part=20211030_187008_261433_14_3232480 tableRemoteFile=backup_11-02-2022_02:10:02/shadow/flows/flows/disk_hdd_20211030_187008_261433_14_3232480.tar tableRemotePath=backup_11-02-2022_02:10:02/shadow/flows/flows/disk_hdd_20211030_187008_261433_14_3232480.tar
2022/02/12 05:11:50.155138 debug findDiffOnePart           database=flows part=20211030_187008_261433_14_3232480 table=flows
2022/02/12 05:11:50.155147 debug findDiffOnePartArchive    database=flows part=20211030_187008_261433_14_3232480 table=flows
2022/02/12 05:11:50.176390 debug findDiffFileExist not found part=20211028_124923_124929_1_3232480 tableRemoteFile=backup_11-02-2022_02:10:02/shadow/flows/flows/default_20211028_124923_124929_1_3232480.tar tableRemotePath=backup_11-02-2022_02:10:02/shadow/flows/flows/default_20211028_124923_124929_1_3232480.tar
2022/02/12 05:11:50.206386 debug findDiffFileExist not found part=20211030_187008_261433_14_3232480 tableRemoteFile=backup_11-02-2022_02:10:02/shadow/flows/flows/default_20211030_187008_261433_14_3232480.tar tableRemotePath=backup_11-02-2022_02:10:02/shadow/flows/flows/default_20211030_187008_261433_14_3232480.tar
2022/02/12 05:11:50.206405 debug findDiffOnePart           database=flows part=20211030_187008_261433_14_3232480 table=flows
2022/02/12 05:11:50.206414 debug findDiffOnePartArchive    database=flows part=20211030_187008_261433_14_3232480 table=flows
 0 / 135278592    0.00%2022/02/12 05:11:50.258211 debug findDiffFileExist not found part=20211030_187008_261433_14_3232480 tableRemoteFile=backup_11-02-2022_02:10:02/shadow/flows/flows/disk_ssd_20211030_187008_261433_14_3232480.tar tableRemotePath=backup_11-02-2022_02:10:02/shadow/flows/flows/disk_ssd_20211030_187008_261433_14_3232480.tar
2022/02/12 05:11:51.276377 debug finish download from backup_10-02-2022_14:43:25/shadow/flows/flows/disk_hdd_20211029_186226_187016_10_3232480.tar operation=downloadDiffRemoteFile
2022/02/12 05:11:51.276433 error one of Download go-routine return error: one of downloadDiffParts go-routine return error: flows.asn_bandwidth 20211027_6933_8455_7 not found on backup_11-02-2022_02:10:02 and all required backups sequence
pam_unix(sudo:session): session closed for user clickhouse
clickhouse-restore-test.service: Main process exited, code=exited, status=1/FAILURE
clickhouse-restore-test.service: Failed with result 'exit-code'.
Failed to start clickhouse-restore-test.
clickhouse-backup list 
10.215.2.76_12-02-2022_02:10:03   ???         12/02/2022 05:10:12   local
10.215.2.76_11-02-2022_02:10:02   ???         12/02/2022 05:10:13   local
10.215.2.76_10-02-2022_14:43:25   ???         12/02/2022 05:10:14   local
....
backup_10-02-2022_14:43:25   2.19TiB     10/02/2022 16:00:18   remote                                      tar
backup_11-02-2022_02:10:02   13.67GiB    11/02/2022 02:10:45   remote   +backup_10-02-2022_14:43:25   tar
backup_12-02-2022_02:10:03   26.92GiB    12/02/2022 02:11:22   remote   +backup_11-02-2022_02:10:02   tar

File default_20211027_6933_8455_7.tar exists on remote storage as well.

Please ask what other information/outputs you need Thanks

Same behavior on multiple servers with the same database scheme

vanyasvl commented 2 years ago

More debug

---[ REQUEST POST-SIGN ]-----------------------------
HEAD /flow-storage-clickhouse-sg-backups/backup_11-02-2022_02%3A10%3A02/shadow/flows/flows/default_20211030_187008_261433_14_3232480.tar HTTP/1.1
Host: storage.files.sin01.cloud.servers.com
User-Agent: aws-sdk-go/1.40.31 (go1.17.6; linux; amd64)
Authorization: AWS4-HMAC-SHA256 Credential=***/20220212/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=b706d9ef86b80daac649c1c6ac615a8449184b5a3b23794481a15fd0b1c46d1e
X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
X-Amz-Date: 20220212T062200Z

-----------------------------------------------------
2022/02/12 06:22:00.191863  info DEBUG: Response s3/HeadObject Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 404 Not Found
Connection: keep-alive
Content-Type: application/xml
Date: Sat, 12 Feb 2022 06:22:00 GMT
Server: nginx
X-Amz-Id-2: txa27e41b1435548ccaab46-0062075208
X-Amz-Request-Id: txa27e41b1435548ccaab46-0062075208
X-Openstack-Request-Id: txa27e41b1435548ccaab46-0062075208
X-Trans-Id: txa27e41b1435548ccaab46-0062075208
Content-Length: 0

-----------------------------------------------------
2022/02/12 06:22:00.191883 debug findDiffFileExist not found part=20211030_187008_261433_14_3232480 tableRemoteFile=backup_11-02-2022_02:10:02/shadow/flows/flows/default_20211030_187008_261433_14_3232480.tar tableRemotePath=backup_11-02-2022_02:10:02/shadow/flows/flows/default_20211030_187008_261433_14_3232480.tar
2022/02/12 06:22:00.251266  info DEBUG: Response s3/GetObject Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 200 OK
Content-Length: 9214307328
Connection: keep-alive
Content-Type: application/x-tar
Date: Sat, 12 Feb 2022 06:22:00 GMT
Etag: "265ce3bc4a80fe5646d5aecb09372c2e"
Last-Modified: Thu, 10 Feb 2022 14:56:57 GMT
Server: nginx
X-Amz-Id-2: tx326f8bcb80544218881cb-0062075208
X-Amz-Request-Id: tx326f8bcb80544218881cb-0062075208
X-Openstack-Request-Id: tx326f8bcb80544218881cb-0062075208
X-Trans-Id: tx326f8bcb80544218881cb-0062075208

-----------------------------------------------------
2022/02/12 06:23:09.534321 debug finish download from backup_10-02-2022_14:43:25/shadow/flows/flows/disk_hdd_20211027_1_31158_12_3232480.tar operation=downloadDiffRemoteFile
2022/02/12 06:23:09.534405 error one of Download go-routine return error: one of downloadDiffParts go-routine return error: flows.asn_bandwidth 20211030_58588_58797_10 not found on backup_11-02-2022_02:10:02 and all required backups sequence
pam_unix(sudo:session): session closed for user clickhouse
clickhouse-restore-test.service: Main process exited, code=exited, status=1/FAILURE
clickhouse-restore-test.service: Failed with result 'exit-code'.
Failed to start clickhouse-restore-test.

As you can see, clickhouse backup asked for file default_20211030_187008_261433_14_3232480.tar for some reason, but then print that part 20211030_58588_58797_10 not found. But file default_20211030_58588_58797_10.tar exists, why clickhouse-backup trying to download non existing file default_20211030_187008_261433_14_3232480.tar instead of existing default_20211030_58588_58797_10.tar ?

vanyasvl commented 2 years ago

See backups metadata in the attach metadata.zip h

vanyasvl commented 2 years ago

And I'm getting this issue only with the latest backup backup_12-02-2022_02:10:03 Pervious backups backup_10-02-2022_14:43:25 (full) and backup_11-02-2022_02:10:02 (diff) downloads without any issues.

vanyasvl commented 2 years ago

Ok, I've looked a little bit into the code. The issue with recursive download, it doesn't work deeper than one level. Condition if requiredPart.Required { is false in function findDiffRecursive, because all partitions in metadata doesn't have required bool flag.

Slach commented 2 years ago

could you share clickhouse-backup list remote ?

vanyasvl commented 2 years ago

Ok, it's seems I found root of the problem: Metadata json files on remote storage has field "required: true" on parts that required older backup in chain. Function downloadTableMetadataIfNotExists check if json locally not exists and call downloadTableMetadata function. Function downloadTableMetadata downloads and parse json, return metadata and save json locally. Here all good, but on the second call, function downloadTableMetadataIfNotExists found previously saved json on local disk, parse it and return metadata from the local file. The issue, that downloadTableMetadata save simplified json on disc, without parts required field and function downloadTableMetadataIfNotExists on the second call returns metadata with required fields false for all partitions.

Patch that fixes this:

diff --git a/pkg/metadata/save.go b/pkg/metadata/save.go
index c47a1d8..3aaab26 100644
--- a/pkg/metadata/save.go
+++ b/pkg/metadata/save.go
@@ -24,6 +24,7 @@ func (tm *TableMetadata) Save(location string, metadataOnly bool) (uint64, error
                for i := range p {
                        newp[i] = Part{
                                Name: p[i].Name,
+                               Required: p[i].Required,
                        }
                }
                parts[disk] = newp

But after that I have another error:

022/02/12 10:11:54.477170 error one of Download go-routine return error: one of downloadDiffParts go-routine return error: can't to add link to exists part /var/lib/clickhouse/backup/backup_12-02-2022_02:10:03/shadow/flows/asn_bandwidth/default/20211031_59771_59858_6 -> /var/lib/clickhouse/backup/backup_11-02-2022_02:10:02/shadow/flows/asn_bandwidth/default/20211031_59771_59858_6 error: open /var/lib/clickhouse/backup/backup_11-02-2022_02:10:02/shadow/flows/asn_bandwidth/default/20211031_59771_59858_6: no such file or directory

clcikhouse-backup tries to make link from the wrong backup directory

vanyasvl commented 2 years ago

To reproduce:

  1. Make and upload full backup
  2. Add more than one partition to the table
  3. Upload --diff-from-remote 1st_backup
  4. Again add more than one partition to the table
  5. Upload --diff-from-remote 2nd_backup

Try to download 3rd backup

Slach commented 2 years ago

I really appreciate to deep debug and steps to reproduce, i will improve integration_test.go for this scenario

vanyasvl commented 2 years ago

And creating hard links doesn't work because existsPath in downloadDiffParts function not calculated recursively.

Quick dirty fix:

diff --git a/pkg/backup/download.go b/pkg/backup/download.go
index 19bb5b6..388bda5 100644
--- a/pkg/backup/download.go
+++ b/pkg/backup/download.go
@@ -400,6 +400,7 @@ func (b *Backuper) downloadDiffParts(remoteBackup metadata.BackupMetadata, table
                                        }

                                        for tableRemoteFile, tableLocalDir := range tableRemoteFiles {
+                                               existsPath = path.Join(tableLocalDir, partForDownload.Name)
                                                err = b.downloadDiffRemoteFile(diffRemoteFilesLock, diffRemoteFilesCache, tableRemoteFile, tableLocalDir)
                                                if err != nil {
                                                        return err
Slach commented 2 years ago

@vanyasvl could you check latest 1.3.1 release? https://github.com/AlexAkulov/clickhouse-backup/releases

As a quick workaround, I would like to suggest use DOWNLOAD_BY_PARTS=false environment variable or config option to return 1.2.x download behavior (full download required backup sequences)

vanyasvl commented 2 years ago

@vanyasvl could you check latest 1.3.1 release? https://github.com/AlexAkulov/clickhouse-backup/releases

As a quick workaround, I would like to suggest use DOWNLOAD_BY_PARTS=false environment variable or config option to return 1.2.x download behavior (full download required backup sequences)

Now it works as expected. Thanks