Closed BoweFlex closed 6 months ago
I added that secret line to the remote_servers definition and restarted the service on all three servers, and still receive the following error:
[jbowe@l-clickhouse101 ~](DEV)$ clickhouse-client -q 'SELECT VERSION()' -h l-clickhouse102.wdn.clarkinc.io
Code: 516. DB::Exception: Received from l-clickhouse102.wdn.clarkinc.io:9000. DB::Exception: default: Authentication failed: password is incorrect, or there is no user with such name.
If you have installed ClickHouse and forgot password you can reset it in the configuration file.
The password for default user is typically located at /etc/clickhouse-server/users.d/default-password.xml
and deleting this file will reset the password.
See also /etc/clickhouse-server/users.xml on the server where ClickHouse is installed.
. (AUTHENTICATION_FAILED)
did you add networks/ip for default user for your servers CIDR (X.X.X.X/Y - your subnetwork)?
check grep -C 20 "X.X.X.X/X" /var/lib/clickhouse/preprocessed_configs/users.xml
compare it with tcpdump results
sudo tcpdump -i any -w /tmp/clickhouse.pcap host l-clickhouse102.wdn.clarkinc.io port 9000
I updated the default user to allow connections from ::/0
, and I can now connect to 102/103 from 101 using clickhouse-client and the default user, and vice-versa. However, when I try to run a backup on 101, it seems to just hang and not make any progress (longest I've let it sit got to 2 hours 45 minutes). Additionally, when I run clickhouse-backup list
on 101 I see the following debug message and get a list of backups from our s3 storage:
2024/05/16 13:07:47.872456 debug /tmp/.clickhouse-backup-metadata.cache.S3 load 28 elements logger=s3
But when I run the list command on 102 or 103 I get a debug message that 0 elements are loaded from s3. Configuration is pushed through ansible to these servers and is identical, so I don't understand why they don't see the same files in our s3 storage.
::/0
is too open
i hope your servers don't allow connect from public internet
you need to figure out with your network
you risk lose your production
use log_level: debug
it could be more informative to watch progress
t when I run the list command on 102 or 103 I get a debug message that 0 elements are loaded from s3.
is your path
parameter in s3
section contains {macro} or something similar?
are you sure configurations is identical?
::/0
is too open I understand, I will work on locking that down but wanted to make sure that I could quickly rule out the problem of networking first.
log_level is currently debug. All logging when the backups take an extremely long time seems successful, and then nothing else seems to be logged.
You are correct that the path parameter contains {shard}, I apologize. I'm not sure how I missed that before. If I change the configuration on 102 to be "shard-1" instead of "shard-{shard}" I see the backups I was expecting. With how clickhouse-server and clickhouse-backup work, will I be able to keep a proper full/incremental chain if I remove that shard number from the path and have backups taken on all three nodes? In other words, if a full backup is taken on 101 can an incremental be taken on 102/103 or only on 101?
If you like, you can store all backups in the same path
you can use shard number in backup name
So there wouldn't be any issues with different nodes taking backups in the same chain?
it depends on your backup / restore script
if you will use shard number in backup name and will check shard number during restore
something like that
SELECT name FROM system.backup_list WHERE name LIKE concat('shard',getMacro('shard'),'%) ORDER BY created DESC
That makes sense. However, I don't think I was very clear with my question. I'm not concerned about "shard2" being able to restore backups from "shard2". I want to confirm that if I take a full backup on "shard1", and then take an incremental backup on "shard2", that that will work and be able to be restored if/when needed
i don't understand question
you want to create an incremental backup on shard2
based on the full backup from shard1
?
why? this is ridiculous and have no sense
shard1 and shard2 contains different part of your data
Maybe I am handling my cluster and replication incorrectly, but I don't see the point in having a quorum of servers if I can only automate backups on one of those servers. If I have an automated backup job running on shard1, and I lose shard1, then backups will not continue to be taken. My solution to this was to run an automated backup job on an external server, and point that to my clickhouse cluster through a load balancer, so that it will always reach a server, but not necessarily the same one. However, this is useless if backups can't actually be taken on any of the three nodes and can only be taken on one of them.
If there is a better way to handle this please let me know.
"shards" means this is isolated data
try to figure out how internally works engine=Distributed
and engine=ReplicatedMergeTree
this is orthogonal components which know nothing about each other.
from my perspective, you try to make overengineering
you can't run clickhouse-backup on "remote" server you can try to use REST api / or system.backup_actions table (which wrapper around REST)
but using it via whole cluster load balancer, bad idea load balancer should be only for one shard, it could works with --diff-from-remote parameter
That makes sense, thanks for helping me understand this better. Is there a different solution you would recommend to make sure we can minimize data lost in an emergency?
Trying to understand how /backup/list works - I have a cluster with three clickhouse servers, each of which are running clickhouse-backup as a systemd service with the rest API server. They are configured with the same service file and environment variables pointing them to s3. I have taken a remote backup on the first replica (clickhouse101) and when I list the backups, whether through querying system.backup_list, running clickhouse-backup list, or the API endpoint, I see that new backup. However, on the other two nodes (clickhouse102 and 103) I do not see the new backup. Should the second and third nodes be seeing that backup in s3 or am I misunderstanding?