berkeli / immersive-go

Creative Commons Zero v1.0 Universal
10 stars 0 forks source link

Troubleshooting exercise - Consul #66

Open berkeli opened 1 year ago

berkeli commented 1 year ago

https://docs.google.com/document/d/1V6HEu_OcJ3MHH-aHzUfANf06VJa1rPcGHcpBwql7QLA/edit#

Troubleshoot consul as per the doc

berkeli commented 1 year ago

.

berkeli commented 1 year ago

add logging to /mnt/consul/config/consul-config.json

"log_file":"/var/log/consul/",
"log_json":true

After adding this, i tried to run consul via the following command: consul agent -config-dir /mnt/consul/config/

But I received a insufficient permission error for var/log/consul/

To resolve this, i have done the following:

  1. sudo mkdir /var/log/consul (couldn't create it without sudo)
  2. Transfer folder to consul:consul via sudo chown consul:consul /var/log/consul
  3. Change permissions for the folder via sudo chmod u+w,g+w,o+w /var/log/consul (I have tried just the user, user + group as well, but no luck. Was still getting permission denied error)

After these commands consul started and I could see logs being created

berkeli commented 1 year ago

After we decided to validate the consul configuration file:

consul validate /mnt/consul/config/
Config validation failed: performance.raft_multiplier cannot be 11. Must be between 1 and 10

after some research i set it to 5, which is the default config. This is a timeout multiplier so it doesn't matter that much (it doesn't relate to number of peers in any way.)

berkeli commented 1 year ago

Run consul validate again

consul validate /mnt/consul/config/
The 'acl.tokens.master' field is deprecated. Use the 'acl.tokens.initial_management' field instead.
bootstrap_expect = 2: A cluster with 2 servers will provide no failure tolerance. See https://www.consul.io/docs/internals/consensus.html#deployment-table
bootstrap_expect > 0: expecting 2 servers
Configuration is valid!

Here we updated the bootstrap_expect to 3 since we will have 3 consul servers.

There's also a warning for deprecation, but we decided to leave it for now

berkeli commented 1 year ago

When comparing consul-config.json files we noticed that one of them had default_policy = allow. Documentation explains the following:

Note: If bootstrapping ACLs on an existing datacenter, remember to update the default policy to default_policy = deny and initiate another rolling restart after applying the token. We provide tokens to all files, therefore we need to change it to "deny".

After making these changes, we ran consul agent on all instances

consul agent -config-dir=/mnt/consul/config/

Consul could successfully be started and we could monitor logs via consul monitor

berkeli commented 1 year ago

At this point, all of the services were started but when we ran consul members it was giving us a list of services that were constantly changing state from alive to failed

We tried to troubleshoot this via different methods:

Mistake 1: Change to use public IPs.

So the IPs looked weird and we (or I, to be correct) decided to try the public IPs for all instances. This led us into a zone where we were trying different things/combinations/researching ACLs and such but this was all pointless. Even went as far as to check the security group in AWS in the hopes of finding what was causing the issue.

After a few hours, we realised that we should switch back to private IPs which we did. This didn't resolve anything at first, but we found out that consul reload doesn't reload settings for IP binding and such and new

berkeli commented 1 year ago

After this we updated the retry_join addresses so each of the nodes has only 2 IPs for the other 2 nodes.

"retry_join": ["172.31.53.195", "172.31.53.179"],

We also added the following 2 options to make sure bind_addr and advertise_addr are set correctly

"advertise_addr": "172.31.53.195",
"bind_addr": "172.31.53.195",
"client_addr": "127.0.0.1 172.31.53.195",

After setting all of these and restarting instances (sudo reboot) we were able to get a list of members and leader election started working correctly.

consul members
Node     Address             Status  Type    Build   Protocol  DC    Partition  Segment
consul1  172.31.87.243:8301  alive   server  1.14.3  2         cyf5  default    <all>
consul2  172.31.53.195:8301  alive   server  1.14.3  2         cyf5  default    <all>
consul3  172.31.53.179:8301  alive   server  1.14.3  2         cyf5  default    <all>
curl localhost:8500/v1/status/leader
"172.31.87.243:8300"

Now that the servers were sorted, we started working on clients.

RitaGlushkova commented 1 year ago

Adding logs as before and then trying to run the clients. We also added "advertise_addr" for each client.

The service.json file for one of the clients had a syntax error (extra comma) and was missing a token.

Added it: “token”: “a15f4b82-4d0f-1a91-4a5b-8b27285dc13d”

RitaGlushkova commented 1 year ago

sudo reboot to reload the system. Doesn't seem like the healthiest solution but nothing else seemed to work.

On one of the servers now check consul members to see if clients are visible.

The problem we had is that consul members showed 2 clients. But Client 1 appeared with the status failed and Client 2 with the status alive.

If we ran systemctl status consulon client 1, it showed as active, but client 2 was inactive. That was a real mismatch.

@berkeli do you have anything else to add to what we tried before registering services?

@RitaGlushkova this was the weird part, even now systemctl status consul is showing the services are dead.

systemctl status consul
● consul.service - Consul Agent
   Loaded: loaded (/etc/systemd/system/consul.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

I think the main problem was that when we ran consul members it was showing client1 but wasn't showing client 2. To resolve, we had to reboot and run consul agent -config-path=/mnt/consul/config/

RitaGlushkova commented 1 year ago

Unfortunately due to rebooting and switching terminals, we lost some of the exact logs. But we had a message saying that services are not registered with clients.

Looking at the documentation we have the following:

The services register command registers a service with the local agent. This command returns after registration and must be paired with explicit service deregistration. This command simplifies service registration from scripts, in dev mode, etc.

We decided to run this command paired with service.json file as this is our service destination: consul services register -token=. /mnt/consul/config/service.json

After multiple reboots and finally adding sudo and -token flag to consul catalog services we managed to register all the services and see them printed out.

@berkeli please correct me if I am forgetting something.

Yes this is correct. Once we had clients listed as members and alive on consul members command, we started looking into consul catalog services

To register clients as services, we used the command consul services register /mnt/consul/config/service.json. This gave us ACL permission denied error. To resolve that we supplied the ACL token provided in consul-config.json for the agent. So consul services register -token=<token> /mnt/consul/config/service.json.

This was successfully registering services, but when we ran consul catalog services it was still showing us nothing (no errors even).

I thought it might be due to the ACL token again and tried running the following consul catalog services -token=<token> which finally listed services:

consul catalog services -token=5e84e364-d64c-43d0-9f37-3a834587f29f
consul
service1
service2

@RitaGlushkova I have looked into the ACL tokens this morning and there's a better way of doing it so we don't need to provide it on every command:

  1. export env variable with the token export CONSUL_HTTP_TOKEN=5e84e364-d64c-43d0-9f37-3a834587f29f
  2. set default acl token for consul agent consul acl set-agent-token agent "5e84e364-d64c-43d0-9f37-3a834587f29f" (this cmd fails without the env variable)

After running these 2 commandds, we can run consul catalog services (and other commands) without supplying the token flag

consul catalog services
consul
service1
service2
berkeli commented 1 year ago

SUMMARY

Servers:

  1. Ensure retry_join option is populated with private IP addresses for the other consul servers in the cluster
  2. Add bind_addr, advertise_addr options populated with own private IP address
  3. Set performance.raft_multiplier to 5
  4. Set bootstrap_expect to 3
  5. Start services via consul agent -config-dir=/mnt/consul/config/

Clients:

  1. Add default acl token to consul:
    • export CONSUL_HTTP_TOKEN=<token provided in consul-config.json file>
    • consul acl set-agent-token agent "<token provided in consul-config.json file>"
  2. Ensure correct private IP address is set for advertise_addr, client_addr
  3. Start services via consul agent -config-dir=/mnt/consul/config/
  4. Register services via consul services register /mnt/consul/config/service.json

Logging on all nodes:

  1. Add "log_file":"/var/log/consul/", "log_json":true options to all servers to enable logging
  2. Create log folder for consul via sudo mkdir /var/log/consul/
  3. Change owner of logging folder to consul sudo chown consul:consul /var/log/consul
  4. Assign permissions to write for logging folder sudo chmod u+w,g+w,o+w /var/log/consul
RitaGlushkova commented 1 year ago

Thank you @berkeli I also did read about tokens last night, but because my IP address changed, couldn't test it. Thank you for trying it and happy it works! The summary looks good.

Radha13 commented 1 year ago

Is there a reason you always started consul from commandline with consul agent -config-dir=/mnt/consul/config/ and not using systemd service that was already configured on the system?

berkeli commented 1 year ago

Is there a reason you always started consul from commandline with consul agent -config-dir=/mnt/consul/config/ and not using systemd service that was already configured on the system?

yes when I ran it with systemctl it was giving an error

systemctl start consul
Failed to start consul.service: The name org.freedesktop.PolicyKit1 was not provided by any .service files
See system logs and 'systemctl status consul.service' for details.

Also similar error when running systemctl restart or systemctl stop.

I think it might have been because we didn't run it with sudo, and if it worked we wouldn't have to reboot as often as well

Radha13 commented 1 year ago

you don't need sudo, but it should be consul.service instead of consul like mentioned in the error message. edit: it works even without .service, ignore me.

[berkeli@consul-server3 ~]$ systemctl status consul.service 
● consul.service - Consul Agent
   Loaded: loaded (/etc/systemd/system/consul.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
[berkeli@consul-server3 ~]$ 
Radha13 commented 1 year ago

Ok, overall great work on getting the expected result :)

Couple of things to know and think about:

This way you didn't really need to start consul agent by hand. you edit/change the config and just run systemctl restart/status/start consul.service.

The only reason this exercise worked because you started the agent by hand, see 2 processes?

[ec2-user@consul-server3 ~]$ ps aux|grep consul |grep agent
root      6557  0.0  0.7 239804  7228 ?        S    Jan17   0:00 sudo consul agent -config-dir /mnt/consul/config/
root      6558  0.7  9.8 818248 97744 ?        Sl   Jan17  20:10 consul agent -config-dir /mnt/consul/config/
[ec2-user@consul-server3 ~]$

I would call this a ~hack~ workaround :D

This seems like a config option that shouldn't matter much but in reality Slack had an outage because we had this field misconfigured across servers :D