Open berkeli opened 1 year ago
.
add logging to /mnt/consul/config/consul-config.json
"log_file":"/var/log/consul/",
"log_json":true
After adding this, i tried to run consul via the following command:
consul agent -config-dir /mnt/consul/config/
But I received a insufficient permission error for var/log/consul/
To resolve this, i have done the following:
sudo mkdir /var/log/consul
(couldn't create it without sudo)consul:consul
via sudo chown consul:consul /var/log/consul
sudo chmod u+w,g+w,o+w /var/log/consul
(I have tried just the user, user + group as well, but no luck. Was still getting permission denied error)After these commands consul started and I could see logs being created
After we decided to validate the consul configuration file:
consul validate /mnt/consul/config/
Config validation failed: performance.raft_multiplier cannot be 11. Must be between 1 and 10
after some research i set it to 5, which is the default config. This is a timeout multiplier so it doesn't matter that much (it doesn't relate to number of peers in any way.)
Run consul validate again
consul validate /mnt/consul/config/
The 'acl.tokens.master' field is deprecated. Use the 'acl.tokens.initial_management' field instead.
bootstrap_expect = 2: A cluster with 2 servers will provide no failure tolerance. See https://www.consul.io/docs/internals/consensus.html#deployment-table
bootstrap_expect > 0: expecting 2 servers
Configuration is valid!
Here we updated the bootstrap_expect to 3 since we will have 3 consul servers.
There's also a warning for deprecation, but we decided to leave it for now
When comparing consul-config.json files we noticed that one of them had default_policy = allow. Documentation explains the following:
Note: If bootstrapping ACLs on an existing datacenter, remember to update the default policy to default_policy = deny and initiate another rolling restart after applying the token.
We provide tokens to all files, therefore we need to change it to "deny".
After making these changes, we ran consul agent on all instances
consul agent -config-dir=/mnt/consul/config/
Consul could successfully be started and we could monitor logs via consul monitor
At this point, all of the services were started but when we ran consul members
it was giving us a list of services that were constantly changing state from alive
to failed
We tried to troubleshoot this via different methods:
So the IPs looked weird and we (or I, to be correct) decided to try the public IPs for all instances. This led us into a zone where we were trying different things/combinations/researching ACLs and such but this was all pointless. Even went as far as to check the security group in AWS in the hopes of finding what was causing the issue.
After a few hours, we realised that we should switch back to private IPs which we did. This didn't resolve anything at first, but we found out that consul reload
doesn't reload settings for IP binding and such and new
After this we updated the retry_join
addresses so each of the nodes has only 2 IPs for the other 2 nodes.
"retry_join": ["172.31.53.195", "172.31.53.179"],
We also added the following 2 options to make sure bind_addr
and advertise_addr
are set correctly
"advertise_addr": "172.31.53.195",
"bind_addr": "172.31.53.195",
"client_addr": "127.0.0.1 172.31.53.195",
After setting all of these and restarting instances (sudo reboot
) we were able to get a list of members and leader election started working correctly.
consul members
Node Address Status Type Build Protocol DC Partition Segment
consul1 172.31.87.243:8301 alive server 1.14.3 2 cyf5 default <all>
consul2 172.31.53.195:8301 alive server 1.14.3 2 cyf5 default <all>
consul3 172.31.53.179:8301 alive server 1.14.3 2 cyf5 default <all>
curl localhost:8500/v1/status/leader
"172.31.87.243:8300"
Now that the servers were sorted, we started working on clients.
Adding logs as before and then trying to run the clients. We also added "advertise_addr" for each client.
The service.json
file for one of the clients had a syntax error (extra comma) and was missing a token.
Added it: “token”: “a15f4b82-4d0f-1a91-4a5b-8b27285dc13d”
sudo reboot
to reload the system. Doesn't seem like the healthiest solution but nothing else seemed to work.
On one of the servers now check consul members
to see if clients are visible.
The problem we had is that consul members showed 2 clients. But Client 1 appeared with the status failed and Client 2 with the status alive.
If we ran systemctl status consul
on client 1, it showed as active, but client 2 was inactive.
That was a real mismatch.
@berkeli do you have anything else to add to what we tried before registering services?
@RitaGlushkova this was the weird part, even now systemctl status consul
is showing the services are dead.
systemctl status consul
● consul.service - Consul Agent
Loaded: loaded (/etc/systemd/system/consul.service; disabled; vendor preset: disabled)
Active: inactive (dead)
I think the main problem was that when we ran consul members
it was showing client1 but wasn't showing client 2. To resolve, we had to reboot and run consul agent -config-path=/mnt/consul/config/
Unfortunately due to rebooting and switching terminals, we lost some of the exact logs. But we had a message saying that services are not registered with clients.
Looking at the documentation we have the following:
The services register command registers a service with the local agent. This command returns after registration and must be paired with explicit service deregistration. This command simplifies service registration from scripts, in dev mode, etc.
We decided to run this command paired with service.json file as this is our service destination:
consul services register -token=. /mnt/consul/config/service.json
After multiple reboots and finally adding sudo
and -token
flag to consul catalog services
we managed to register all the services and see them printed out.
@berkeli please correct me if I am forgetting something.
Yes this is correct. Once we had clients listed as members and alive on consul members
command, we started looking into consul catalog services
To register clients as services, we used the command consul services register /mnt/consul/config/service.json
. This gave us ACL permission denied error. To resolve that we supplied the ACL token provided in consul-config.json
for the agent. So consul services register -token=<token> /mnt/consul/config/service.json
.
This was successfully registering services, but when we ran consul catalog services
it was still showing us nothing (no errors even).
I thought it might be due to the ACL token again and tried running the following consul catalog services -token=<token>
which finally listed services:
consul catalog services -token=5e84e364-d64c-43d0-9f37-3a834587f29f
consul
service1
service2
@RitaGlushkova I have looked into the ACL tokens this morning and there's a better way of doing it so we don't need to provide it on every command:
export CONSUL_HTTP_TOKEN=5e84e364-d64c-43d0-9f37-3a834587f29f
consul acl set-agent-token agent "5e84e364-d64c-43d0-9f37-3a834587f29f"
(this cmd fails without the env variable)After running these 2 commandds, we can run consul catalog services (and other commands) without supplying the token flag
consul catalog services
consul
service1
service2
retry_join
option is populated with private IP addresses for the other consul servers in the clusterbind_addr
, advertise_addr
options populated with own private IP addressperformance.raft_multiplier
to 5bootstrap_expect
to 3consul agent -config-dir=/mnt/consul/config/
export CONSUL_HTTP_TOKEN=<token provided in consul-config.json file>
consul acl set-agent-token agent "<token provided in consul-config.json file>"
advertise_addr
, client_addr
consul agent -config-dir=/mnt/consul/config/
consul services register /mnt/consul/config/service.json
"log_file":"/var/log/consul/", "log_json":true
options to all servers to enable loggingsudo mkdir /var/log/consul/
sudo chown consul:consul /var/log/consul
sudo chmod u+w,g+w,o+w /var/log/consul
Thank you @berkeli I also did read about tokens last night, but because my IP address changed, couldn't test it. Thank you for trying it and happy it works! The summary looks good.
Is there a reason you always started consul from commandline with consul agent -config-dir=/mnt/consul/config/
and not using systemd service that was already configured on the system?
Is there a reason you always started consul from commandline with
consul agent -config-dir=/mnt/consul/config/
and not using systemd service that was already configured on the system?
yes when I ran it with systemctl it was giving an error
systemctl start consul
Failed to start consul.service: The name org.freedesktop.PolicyKit1 was not provided by any .service files
See system logs and 'systemctl status consul.service' for details.
Also similar error when running systemctl restart
or systemctl stop
.
I think it might have been because we didn't run it with sudo, and if it worked we wouldn't have to reboot as often as well
you don't need sudo, but it should be consul.service instead of consul like mentioned in the error message.
edit: it works even without .service
, ignore me.
[berkeli@consul-server3 ~]$ systemctl status consul.service
● consul.service - Consul Agent
Loaded: loaded (/etc/systemd/system/consul.service; disabled; vendor preset: disabled)
Active: inactive (dead)
[berkeli@consul-server3 ~]$
Ok, overall great work on getting the expected result :)
Couple of things to know and think about:
/etc/consul.d
. The consul process on each of these systems are managed by systemd. You can see the systemd config in /etc/systemd/system/consul.service
. This files tells systemd about how to manage consul process.This way you didn't really need to start consul agent by hand. you edit/change the config and just run systemctl restart/status/start consul.service
.
[ec2-user@consul-server3 ~]$ systemctl status consul.service
● consul.service - Consul Agent
Loaded: loaded (/etc/systemd/system/consul.service; disabled; vendor preset: disabled)
Active: inactive (dead)
[ec2-user@consul-server3 ~]$
The only reason this exercise worked because you started the agent by hand, see 2 processes?
[ec2-user@consul-server3 ~]$ ps aux|grep consul |grep agent
root 6557 0.0 0.7 239804 7228 ? S Jan17 0:00 sudo consul agent -config-dir /mnt/consul/config/
root 6558 0.7 9.8 818248 97744 ? Sl Jan17 20:10 consul agent -config-dir /mnt/consul/config/
[ec2-user@consul-server3 ~]$
I would call this a ~hack~ workaround :D
after some research i set it to 5, which is the default config. This is a timeout multiplier so it doesn't matter that much (it doesn't relate to number of peers in any way.)
This seems like a config option that shouldn't
matter much but in reality Slack had an outage because we had this field misconfigured across servers :D
https://docs.google.com/document/d/1V6HEu_OcJ3MHH-aHzUfANf06VJa1rPcGHcpBwql7QLA/edit#
Troubleshoot consul as per the doc