earthgecko / skyline

Anomaly detection
http://earthgecko-skyline.readthedocs.io/en/latest/
Other
557 stars 64 forks source link

Redis exception in horizon #470

Closed krishh-konar closed 2 years ago

krishh-konar commented 2 years ago

While trying to set-up skyline, I am getting a redis exception in horizon logs. The redis server restarted once due to a server reboot.

2021-12-21 16:28:27 :: 9752 :: horizon.worker :: total queue values known for the last 60 seconds - 4
2021-12-21 16:28:27 :: 9752 :: horizon.worker :: average queue size for the last 60 seconds - 0
2021-12-21 16:28:27 :: 9752 :: horizon.worker :: datapoints_sent_to_redis in last 60 seconds - 0
2021-12-21 16:28:27 :: 9752 :: Traceback (most recent call last):
  File "/opt/skyline/github/skyline/skyline/horizon/worker.py", line 505, in run
    self.redis_conn.sadd('horizon.metrics_received', *set(metrics_received))
  File "/opt/python_virtualenv/projects/skyline-py3810/lib/python3.8/site-packages/redis/client.py", line 2243, in sadd
    return self.execute_command('SADD', name, *values)
  File "/opt/python_virtualenv/projects/skyline-py3810/lib/python3.8/site-packages/redis/client.py", line 901, in execute_command
    return self.parse_response(conn, command_name, **options)
  File "/opt/python_virtualenv/projects/skyline-py3810/lib/python3.8/site-packages/redis/client.py", line 915, in parse_response
    response = connection.read_response()
  File "/opt/python_virtualenv/projects/skyline-py3810/lib/python3.8/site-packages/redis/connection.py", line 756, in read_response
    raise response
redis.exceptions.ResponseError: wrong number of arguments for 'sadd' command

2021-12-21 16:28:27 :: 9752 :: horizon.worker :: error adding horizon.metrics_received
2021-12-21 16:28:27 :: 9752 :: renaming key horizon.metrics_received to aet.horizon.metrics_received
2021-12-21 16:28:27 :: 9752 :: Traceback (most recent call last):
  File "/opt/skyline/github/skyline/skyline/horizon/worker.py", line 512, in run
    self.redis_conn.rename('horizon.metrics_received', 'aet.horizon.metrics_received')
  File "/opt/python_virtualenv/projects/skyline-py3810/lib/python3.8/site-packages/redis/client.py", line 1747, in rename
    return self.execute_command('RENAME', src, dst)
  File "/opt/python_virtualenv/projects/skyline-py3810/lib/python3.8/site-packages/redis/client.py", line 901, in execute_command
    return self.parse_response(conn, command_name, **options)
  File "/opt/python_virtualenv/projects/skyline-py3810/lib/python3.8/site-packages/redis/client.py", line 915, in parse_response
    response = connection.read_response()
  File "/opt/python_virtualenv/projects/skyline-py3810/lib/python3.8/site-packages/redis/connection.py", line 756, in read_response
    raise response
redis.exceptions.ResponseError: no such key

2021-12-21 16:28:27 :: 9752 :: error :: failed to rename Redis key horizon.metrics_received to aet.horizon.metrics_received - no such key
2021-12-21 16:28:42 :: 9754 :: horizon.worker :: worker queue is empty and timed out
2021-12-21 16:28:42 :: 9752 :: horizon.worker :: worker queue is empty and timed out
2021-12-21 16:28:42 :: 9752 :: horizon.worker :: total queue size for the last 10 seconds - 0
2021-12-21 16:28:42 :: 9752 :: horizon.worker :: total queue values known for the last 10 seconds - 1

I tried checking redis if those keys existed or not.

redis /var/run/redis/redis.sock> get aet.horizon.metrics_received
(nil)
redis /var/run/redis/redis.sock> get horizon.metrics_received
(nil)
redis /var/run/redis/redis.sock> type horizon.metrics_received
none
redis /var/run/redis/redis.sock> type aet.horizon.metrics_received
none
redis /var/run/redis/redis.sock>

I have configured graphite to relay metrics to horizon, and I can see I'm recieving data using tcpdump, but I cant get any data into skyline.

Can someone help me with the redis issue?

earthgecko commented 2 years ago

Hi @krishh-konar a few questions to try and pinpoint where the issue lies.

It looks like horizon is NOT recieving metrics. I do not think this is a Redis issue, see my next comments below after this one.

unixsocket /var/run/redis/redis.sock
unixsocketperm 777

If none of the above steer you in the right direction, please could you paste more of your horizon log please, you can also jump on to https://gitter.im/earthgecko-skyline if that is quicker.

earthgecko commented 2 years ago

Hi @krishh-konar

The error suggests that there are no metrics processed by the worker, the metrics_received list is empty which is why redis throws a wrong number of arguments for 'sadd' command as sadd needs data to add members. How many metrics are you sending from Graphite? average queue size for the last 60 seconds - 0 also suggests that none are being received.

Could you stop horizon, move the horizon log file and start horizon again and then paste the first 1 or 2 full minutes of the log.

When you do that have a look at your Graphite carbon-relay log as well as that should tell you that the connection was dropped and then when horizon is started again the carbon-relay log should report that the connection was re-established.

Also if you look at your Graphite browser you should see metrics for the Skyline Horizon destination e.g. carbon.relays.YOUR_GRAPHITE_HOST-a.destinations.SKYLINE_IP:2014:None.sent and fullQueueDrops if Graphite is sending metrics to horizon you will see these metrics in Graphite and they will confirm if Graphite is sending or dropping.

krishh-konar commented 2 years ago

Thanks for the reply. This is the redis version I'm using:

redis-server:
  Installed: 5:5.0.14

I checked the redis confs, and the params are set correctly in both redis conf and settings.py in skyline. Im also seeing redis getting populated, so i think the connection/access is fine.

redis /var/run/redis/redis.sock> keys *
 1) "analyzer.metrics_manager.resolutions"
 2) "skyline.external_settings"
 3) "aet.analyzer.unique_base_names"
 4) "aet.analyzer.too_short"
 5) "analyzer.boring"
 6) "mirage"
 7) "thunder.rolling"
 8) "analyzer.unique_base_names"
 9) "analyzer.metrics_manager.metrics_sparsity.avg_sparsity"
10) "analyzer.metrics_manager.metrics_sparsity.metrics_sparsity_increasing"
11) "thunder.last.luminosity"
12) "analyzer.too_short"
13) "analyzer.metrics_manager.metrics_sparsity.metrics_sparsity_decreasing"
14) "redis"
15) "thunder.last.redis"
16) "thunder.last.panorama"
17) "new_non_derivative_metrics"
18) "thunder"
19) "panorama"
20) "aet.analyzer.metrics_manager.metrics_sparsity_increasing"
21) "analyzer.metrics_manager.hash_key.metrics_data_sparsity"
22) "luminosity.cloudbursts"
23) "aet.analyzer.stale"
24) "analyzer.metrics_manager.metrics_sparsity_decreasing"
.......
krishh-konar commented 2 years ago

Thanks for the reply. This is the redis version I'm using:

redis-server:
  Installed: 5:5.0.14

I checked the redis confs, and the params are set correctly in both redis conf and settings.py in skyline. Im also seeing redis getting populated, so i think the connection/access is fine.

redis /var/run/redis/redis.sock> keys *
 1) "analyzer.metrics_manager.resolutions"
 2) "skyline.external_settings"
 3) "aet.analyzer.unique_base_names"
 4) "aet.analyzer.too_short"
 5) "analyzer.boring"
 6) "mirage"
 7) "thunder.rolling"
 8) "analyzer.unique_base_names"
 9) "analyzer.metrics_manager.metrics_sparsity.avg_sparsity"
10) "analyzer.metrics_manager.metrics_sparsity.metrics_sparsity_increasing"
11) "thunder.last.luminosity"
12) "analyzer.too_short"
13) "analyzer.metrics_manager.metrics_sparsity.metrics_sparsity_decreasing"
14) "redis"
15) "thunder.last.redis"
16) "thunder.last.panorama"
17) "new_non_derivative_metrics"
18) "thunder"
19) "panorama"
20) "aet.analyzer.metrics_manager.metrics_sparsity_increasing"
21) "analyzer.metrics_manager.hash_key.metrics_data_sparsity"
22) "luminosity.cloudbursts"
23) "aet.analyzer.stale"
24) "analyzer.metrics_manager.metrics_sparsity_decreasing"
.......
krishh-konar commented 2 years ago

Hi @krishh-konar

The error suggests that there are no metrics processed by the worker, the metrics_received list is empty which is why redis throws a wrong number of arguments for 'sadd' command as sadd needs data to add members. How many metrics are you sending from Graphite? average queue size for the last 60 seconds - 0 also suggests that none are being received.

Could you stop horizon, move the horizon log file and start horizon again and then paste the first 1 or 2 full minutes of the log.

When you do that have a look at your Graphite carbon-relay log as well as that should tell you that the connection was dropped and then when horizon is started again the carbon-relay log should report that the connection was re-established.

Also if you look at your Graphite browser you should see metrics for the Skyline Horizon destination e.g. carbon.relays.YOUR_GRAPHITE_HOST-a.destinations.SKYLINE_IP:2014:None.sent and fullQueueDrops if Graphite is sending metrics to horizon you will see these metrics in Graphite and they will confirm if Graphite is sending or dropping.

thanks, i'll try all these and hopefull it works 🤞 🤞

krishh-konar commented 2 years ago

Hi @krishh-konar

The error suggests that there are no metrics processed by the worker, the metrics_received list is empty which is why redis throws a wrong number of arguments for 'sadd' command as sadd needs data to add members. How many metrics are you sending from Graphite? average queue size for the last 60 seconds - 0 also suggests that none are being received.

Could you stop horizon, move the horizon log file and start horizon again and then paste the first 1 or 2 full minutes of the log.

When you do that have a look at your Graphite carbon-relay log as well as that should tell you that the connection was dropped and then when horizon is started again the carbon-relay log should report that the connection was re-established.

Also if you look at your Graphite browser you should see metrics for the Skyline Horizon destination e.g. carbon.relays.YOUR_GRAPHITE_HOST-a.destinations.SKYLINE_IP:2014:None.sent and fullQueueDrops if Graphite is sending metrics to horizon you will see these metrics in Graphite and they will confirm if Graphite is sending or dropping.

I checked the relay logs and the connection is indeed being made.

22/12/2021 09:04:57 :: [console] Stopping factory CarbonClientFactory(10.21.0.81:2024:None)
22/12/2021 09:05:03 :: [console] Starting factory CarbonClientFactory(10.21.0.81:2024:None)
22/12/2021 09:05:03 :: [clients] CarbonClientFactory(10.21.0.81:2024:None)::startedConnecting (10.21.0.81:2024)
22/12/2021 09:05:03 :: [clients] CarbonClientProtocol(10.21.0.81:2024:None)::connectionMade
22/12/2021 09:05:03 :: [clients] CarbonClientFactory(10.21.0.81:2024:None)::connectionMade (CarbonClientProtocol(10.21.0.81:2024:None))
22/12/2021 09:05:03 :: [clients] Destination is up: 10.21.0.81:2024:None

I checked the graphite data as well to make sure there is no issue wih the metric itself, and I can see the graphs created using these metrics on grafana. I checked tcpdump to see if connection is actually being made with horizon, and I did see data there:

krishanu@node1:~$ sudo tcpdump -i any port  2024 and inbound
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
09:23:51.620031 IP node2.internal.47350 > node1.internal.2024: Flags [P.], seq 3260479491:3260480358, ack 1518300410, win 511, options [nop,nop,TS val 1149844013 ecr 2375717253], length 867
09:24:21.228514 IP node2.internal.47350 > node1.internal.2024: Flags [.], ack 1, win 511, options [nop,nop,TS val 1149873618 ecr 2375718365], length 0
09:24:51.657930 IP node2.internal.47350 > node1.internal.2024: Flags [P.], seq 867:1734, ack 1, win 511, options [nop,nop,TS val 1149904046 ecr 2375747970], length 867
09:25:01.676292 IP node2.internal.47350 > node1.internal.2024: Flags [.], ack 1, win 511, options [nop,nop,TS val 1149914064 ecr 2375778397], length 0
09:25:32.908335 IP node2.internal.47350 > node1.internal.2024: Flags [.], ack 1, win 511, options [nop,nop,TS val 1149945294 ecr 2375788414], length 0

I made the api call to see what metrics horizon has an I got the following response:

$ curl http://node1.internal/api?unique_metrics
{
    "data": {
        "metrics": [
            "metrics.carbon.relays.node1-a.destinations.10_21_0_81:2024:None.relayMaxQueueLength",
            "metrics.carbon.relays.node1-a.cpuUsage",
            "metrics.carbon.relays.node1-a.memUsage",
            "metrics.carbon.relays.node1-a.destinations.10_21_0_81:2024:None.attemptedRelays",
            "metrics.carbon.relays.node1-a.blacklistMatches",
            "metrics.carbon.relays.node1-a.metricsReceived",
            "metrics.carbon.relays.node1-a.activeConnections",
            "metrics.carbon.relays.node1-a.destinations.10_21_0_81:2024:None.queuedUntilConnected",
            "metrics.carbon.relays.node1-a.destinations.10_21_0_81:2024:None.sent",
            "metrics.carbon.relays.node1-a.destinations.10_21_0_81:2024:None.batchesSent",
            "metrics.carbon.relays.node1-a.whitelistRejects"
        ]
    },
    "status": {
        "cluster_data": false,
        "filter_namespace": null,
        "filtered": false,
        "remove_full_namespace_prefix": false,
        "request_time": 0.0006222724914550781,
        "response": 200
    }
}

I am probably making a basic mistake. Is there a prefix that our metric name must have or some other config changes that needs to be done? I am currently pushing data to graphite once per minute, I also tried decreasing the interval when I push the data. Do i need to make some other configuration changes?

I haven't made any changes to storage-schemas.conf in carbon. Thanks!

earthgecko commented 2 years ago

Hi @krishh-konar

That looks OK, so Horizon has received data for those metrics. How many metrics are you pushing to Graphite (not Horizon)? Meaning how many metrics does Graphite have? Are you pushing metrics to Graphite for other sources/machines? The metrics being reported above are Graphite's own metrics which it is pushing to Horizon, if it had more metrics that it was getting data for it would push those too. One exception is the Graphite carbon.cache metrics, carbon-relay does NOT push those because carbon-cache records those directly to itself and they are not forwarded via carbon-relay.

Could you paste or attach your Graphite relay-rules.conf and carbon.conf (anonymised)

Your relay-rules.conf should look something like

[all]
pattern = .*
destinations = 0.0.0.0:2004, 10.21.0.81:2024

[default]
default = true
destinations = 0.0.0.0:2004:a, 10.21.0.81:2024:a
# WHERE 0.0.0.0:2004 is the local carbon-cache

Given your metrics I think that IP should be correct and I am assuming that carbon-cache is running on the same IP as carbon-relay. However given the list of metrics you pasted above it does not look like you are sending metrics to the local carbon-cache. We would expect to see metrics for the local carbon-cache too, something like carbon.relays.node1-a.destinations.0_0_0_0:2004:None.sent

carbon-relay should forward metrics to:

carbon-relay |-> carbon-cache
             |-> Horizon
krishh-konar commented 2 years ago

Figured out the issue on the discussion on Gitter, I was sending metrics directly to carbon-cache and not carbon-relay. Sending metrics to :2013 instead of :2003 solved the issue.

Many thanks :)