Altinity / clickhouse-backup

Tool for easy backup and restore for ClickHouse® using object storage for backup files.
https://altinity.com
Other
1.28k stars 226 forks source link

API: /backup/list #902

Closed BoweFlex closed 6 months ago

BoweFlex commented 6 months ago

Trying to understand how /backup/list works - I have a cluster with three clickhouse servers, each of which are running clickhouse-backup as a systemd service with the rest API server. They are configured with the same service file and environment variables pointing them to s3. I have taken a remote backup on the first replica (clickhouse101) and when I list the backups, whether through querying system.backup_list, running clickhouse-backup list, or the API endpoint, I see that new backup. However, on the other two nodes (clickhouse102 and 103) I do not see the new backup. Should the second and third nodes be seeing that backup in s3 or am I misunderstanding?

BoweFlex commented 6 months ago

I added that secret line to the remote_servers definition and restarted the service on all three servers, and still receive the following error:

[jbowe@l-clickhouse101 ~](DEV)$ clickhouse-client -q 'SELECT VERSION()' -h l-clickhouse102.wdn.clarkinc.io
Code: 516. DB::Exception: Received from l-clickhouse102.wdn.clarkinc.io:9000. DB::Exception: default: Authentication failed: password is incorrect, or there is no user with such name.

If you have installed ClickHouse and forgot password you can reset it in the configuration file.
The password for default user is typically located at /etc/clickhouse-server/users.d/default-password.xml
and deleting this file will reset the password.
See also /etc/clickhouse-server/users.xml on the server where ClickHouse is installed.

. (AUTHENTICATION_FAILED)
Slach commented 6 months ago

did you add networks/ip for default user for your servers CIDR (X.X.X.X/Y - your subnetwork)?

check grep -C 20 "X.X.X.X/X" /var/lib/clickhouse/preprocessed_configs/users.xml

compare it with tcpdump results sudo tcpdump -i any -w /tmp/clickhouse.pcap host l-clickhouse102.wdn.clarkinc.io port 9000

BoweFlex commented 6 months ago

I updated the default user to allow connections from ::/0, and I can now connect to 102/103 from 101 using clickhouse-client and the default user, and vice-versa. However, when I try to run a backup on 101, it seems to just hang and not make any progress (longest I've let it sit got to 2 hours 45 minutes). Additionally, when I run clickhouse-backup list on 101 I see the following debug message and get a list of backups from our s3 storage:

2024/05/16 13:07:47.872456 debug /tmp/.clickhouse-backup-metadata.cache.S3 load 28 elements logger=s3

But when I run the list command on 102 or 103 I get a debug message that 0 elements are loaded from s3. Configuration is pushed through ansible to these servers and is identical, so I don't understand why they don't see the same files in our s3 storage.

Slach commented 6 months ago

::/0 is too open i hope your servers don't allow connect from public internet you need to figure out with your network you risk lose your production

Slach commented 6 months ago

use log_level: debug it could be more informative to watch progress

t when I run the list command on 102 or 103 I get a debug message that 0 elements are loaded from s3.

is your path parameter in s3 section contains {macro} or something similar?

are you sure configurations is identical?

BoweFlex commented 6 months ago

::/0 is too open I understand, I will work on locking that down but wanted to make sure that I could quickly rule out the problem of networking first.

log_level is currently debug. All logging when the backups take an extremely long time seems successful, and then nothing else seems to be logged.

You are correct that the path parameter contains {shard}, I apologize. I'm not sure how I missed that before. If I change the configuration on 102 to be "shard-1" instead of "shard-{shard}" I see the backups I was expecting. With how clickhouse-server and clickhouse-backup work, will I be able to keep a proper full/incremental chain if I remove that shard number from the path and have backups taken on all three nodes? In other words, if a full backup is taken on 101 can an incremental be taken on 102/103 or only on 101?

Slach commented 6 months ago

If you like, you can store all backups in the same path you can use shard number in backup name

BoweFlex commented 6 months ago

So there wouldn't be any issues with different nodes taking backups in the same chain?

Slach commented 6 months ago

it depends on your backup / restore script

if you will use shard number in backup name and will check shard number during restore something like that SELECT name FROM system.backup_list WHERE name LIKE concat('shard',getMacro('shard'),'%) ORDER BY created DESC

BoweFlex commented 5 months ago

That makes sense. However, I don't think I was very clear with my question. I'm not concerned about "shard2" being able to restore backups from "shard2". I want to confirm that if I take a full backup on "shard1", and then take an incremental backup on "shard2", that that will work and be able to be restored if/when needed

Slach commented 5 months ago

i don't understand question you want to create an incremental backup on shard2 based on the full backup from shard1? why? this is ridiculous and have no sense

shard1 and shard2 contains different part of your data

BoweFlex commented 5 months ago

Maybe I am handling my cluster and replication incorrectly, but I don't see the point in having a quorum of servers if I can only automate backups on one of those servers. If I have an automated backup job running on shard1, and I lose shard1, then backups will not continue to be taken. My solution to this was to run an automated backup job on an external server, and point that to my clickhouse cluster through a load balancer, so that it will always reach a server, but not necessarily the same one. However, this is useless if backups can't actually be taken on any of the three nodes and can only be taken on one of them.

If there is a better way to handle this please let me know.

Slach commented 5 months ago

"shards" means this is isolated data try to figure out how internally works engine=Distributed and engine=ReplicatedMergeTree this is orthogonal components which know nothing about each other.

from my perspective, you try to make overengineering

you can't run clickhouse-backup on "remote" server you can try to use REST api / or system.backup_actions table (which wrapper around REST)

but using it via whole cluster load balancer, bad idea load balancer should be only for one shard, it could works with --diff-from-remote parameter

BoweFlex commented 5 months ago

That makes sense, thanks for helping me understand this better. Is there a different solution you would recommend to make sure we can minimize data lost in an emergency?