aws-ia / cfn-ps-clickhouse-cluster

Apache License 2.0
10 stars 6 forks source link

the clickhouse instances cannot seem to be able to connect to the zookeeper nodes #43

Closed mohammadQBNL closed 2 months ago

mohammadQBNL commented 3 months ago

looking at the clickhouse-instance server logs, I get this:

[   59.857328] capability: warning: `clickhouse-serv' uses 32-bit capabilities (legacy support in use)
[   94.464903] cloud-init[2655]: Received exception from server (version 23.3.8):
[   94.467157] cloud-init[2655]: Code: 999. DB::Exception: Received from localhost:9000. Coordination::Exception. Coordination::Exception: Connection loss, path: All connection tries failed while connecting to ZooKeeper. nodes: 172.31.34.64:2181, 172.31.39.171:2181, 172.31.42.56:2181
[   94.471162] cloud-init[2655]: Poco::Exception. Code: 1000, e.code() = 0, Timeout: connect timed out: 172.31.34.64:2181 (version 23.3.8.21 (official build)), 172.31.34.64:2181
[   94.471913] cloud-init[2655]: Poco::Exception. Code: 1000, e.code() = 0, Timeout: connect timed out: 172.31.39.171:2181 (version 23.3.8.21 (official build)), 172.31.39.171:2181
[   94.472074] cloud-init[2655]: Poco::Exception. Code: 1000, e.code() = 0, Timeout: connect timed out: 172.31.42.56:2181 (version 23.3.8.21 (official build)), 172.31.42.56:2181
[   94.472208] cloud-init[2655]: Poco::Exception. Code: 1000, e.code() = 0, Timeout: connect timed out: 172.31.34.64:2181 (version 23.3.8.21 (official build)), 172.31.34.64:2181
[   94.472333] cloud-init[2655]: Poco::Exception. Code: 1000, e.code() = 0, Timeout: connect timed out: 172.31.39.171:2181 (version 23.3.8.21 (official build)), 172.31.39.171:2181
[   94.472456] cloud-init[2655]: Poco::Exception. Code: 1000, e.code() = 0, Timeout: connect timed out: 172.31.42.56:2181 (version 23.3.8.21 (official build)), 172.31.42.56:2181
[   94.472580] cloud-init[2655]: Poco::Exception. Code: 1000, e.code() = 0, Timeout: connect timed out: 172.31.34.64:2181 (version 23.3.8.21 (official build)), 172.31.34.64:2181
[   94.472702] cloud-init[2655]: Poco::Exception. Code: 1000, e.code() = 0, Timeout: connect timed out: 172.31.39.171:2181 (version 23.3.8.21 (official build)), 172.31.39.171:2181
[   94.472822] cloud-init[2655]: Poco::Exception. Code: 1000, e.code() = 0, Timeout: connect timed out: 172.31.42.56:2181 (version 23.3.8.21 (official build)), 172.31.42.56:2181
[   94.472943] cloud-init[2655]: . (KEEPER_EXCEPTION)

Params used:

AccessCIDR  0.0.0.0/0   -
AlarmEmail  mohammad@quantboxtrading.com    -
Architecture    X86 -
BastionAMIOS    Amazon-Linux2-HVM   -
BastionInstanceType t2.micro    -
ClickHouseDeviceName    /dev/xvdh   -
ClickHouseInstanceType  m5.xlarge   -
ClickHouseIops  1000    -
ClickHouseNodeCount 2   -
ClickHousePkgS3URI  none    -
ClickHouseTimezone  Asia/Kolkata    -
ClickHouseVersion   23.3.8.21   -
ClickHouseVolumeSize    500 -
ClickHouseVolumeType    gp2 -
DemoDataSize    small   -
DistributedProductMode  global  -
GrafanaVersion  8.0.1-1 -
KeyPairName clickhouse  -
LatestAmiId /aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2   ami-09859101d317198f9
LoadBalancing   nearest_hostname    -
MaxDataPartSize 1073741824  -
MaxInsertThreads    4   -
MaxMemoryUsage  10737418240 -
MaxThreads  8   -
MoveFactor  0.3 -
NumBastionHosts 1   -
Port    8123    -
PrivateSubnet1AID   subnet-0b828b7f76af6ee2d    -
PrivateSubnet2AID   subnet-0b828b7f76af6ee2d    -
PublicSubnet1ID subnet-05b7a9f2b75c4eecc    -
PublicSubnet2ID subnet-0ad76a414852fbbb9    -
QSS3BucketName  aws-ia  -
QSS3BucketRegion    us-east-1   -
QSS3KeyPrefix   cfn-ps-clickhouse-cluster/  -
RemoteAccessCIDR    0.0.0.0/0   -
SingleAvailableZone 1az -
VPCCIDR 10.0.0.0/16 -
VPCID   vpc-0dd2a588c6e22a08e   -
ZookeeperDeviceName /dev/xvdh   -
ZookeeperInstanceType   m5.large    -
ZookeeperIops   1000    -
ZookeeperNodeCount  3   -
ZookeeperVersion    3.8.2   -
ZookeeperVolumeSize 500 -
ZookeeperVolumeType gp2 -

I'm also able to access the bastion host, all 3 zookeepers and 2 clickhouse instances, I see clickhouse-server is running, zookeepers also

Worth noting this only happens when installing on an existing VPC, new VPC from scratch worked fine

mohammadQBNL commented 2 months ago

fixed, had to setup private/public subnets with their NATs and routing tables, also had the incorrect VPC CIDR set (was leaving as default but my VPC was using a different CIDR block).