Closed anclrii closed 3 years ago
Maybe the simplest workaround for the moment would be to change the way the exporter polls data. Like let it collect the node metrics at a configureable interval (e.g. 30 seconds) and store those values internally. Then only respond to scraping requests (and whatever netdata does) with that stored data. That would reduce the load we're all seeing but doesn't require much programming to implement.
ok I just tested it and it has absolutely no effect. The problem seems to be that every request is taking a lot of resources and time and not the polling of the storagenode. The polling of the storagenode api actually only takes 3-5ms but calling "curl localhost:9651" on a script that only returns the stored values takes 1.5s!
Here are the modified parts of your script btw:
class StorjCollector(object):
def __init__(self):
self.storj_host_address = os.environ.get('STORJ_HOST_ADDRESS', '127.0.0.1')
self.storj_api_port = os.environ.get('STORJ_API_PORT', '14002')
self.data = {}
def get_data(self):
print("Requested data")
return self.data
def poll(self):
a = time.time()
self.data = self.call_api("sno/")
b = time.time()
print("poll took", b - a)
...
if __name__ == "__main__":
storj_exporter_port = os.environ.get('STORJ_EXPORTER_PORT', '9651')
storj_polling_interval = int(os.environ.get('STORJ_POLLING_INTERVAL', 30))
collector = StorjCollector()
collector.poll()
REGISTRY.register(collector)
start_http_server(int(storj_exporter_port))
while True:
time.sleep(storj_polling_interval)
collector.poll()
I thought it has something to do with docker and used the script without docker but that doesn't change anything. Every 5s there is a spike in the storagenode too when the script is running. So either I have no idea or I did something wrong.. Behaviour is the same with both scripts so maybe I just don't quite understand how the prometheus client works..
Thanks for the update, it works great! However, it makes the CPU spikes even worse:
And with 3 nodes on the same host it's noticeable.
I posted this in the forum too: It turns out that netdata was by default discovering the exporter endpoint and polling it every 5 seconds. As a result all data got stored in prometheus twice if you scrape the exporter and netdata.
I got rid of netdata polling the exporter endpoint by using docker-compose to connect the exporter to prometheus without exposing any port to the host. This significantly reduced the CPU spikes.
Here's the docker-compose.yml I used:
version: '3.7'
services:
storagenode:
image: storjlabs/storagenode:latest
container_name: storagenode1
user: "1000:1000"
restart: unless-stopped
ports:
- 7777:7770
- 14002:14002
- 28967:28967
environment:
- WALLET=
- EMAIL=
- ADDRESS=
- STORAGE=9TB
volumes:
- type: bind
source: /media/STORJ/STORJ
target: /app/config
- type: bind
source: /media/STORJ/identity
target: /app/identity
networks:
- default
stop_grace_period: 300s
deploy:
resources:
limits:
memory: 4096M
storj-exporter:
image: anclrii/storj-exporter:latest
container_name: storj-exporter1
user: "1000:1000"
restart: unless-stopped
environment:
- STORJ_HOST_ADDRESS=storagenode1
- STORJ_API_PORT=14002
- STORJ_EXPORTER_PORT=9651
networks:
- default
prometheus:
image: prom/prometheus
container_name: prometheus
user: "1000:1000"
ports:
- 9090:9090
volumes:
- /sharedfolders/config/prometheus.yml:/etc/prometheus/prometheus.yml
- type: bind
source: /sharedfolders/prometheus
target: /prometheus
restart: unless-stopped
command: --storage.tsdb.retention.time=720d --storage.tsdb.retention.size=30GB --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/prometheus
networks:
- default
networks:
default:
and in promtheus.yml:
- job_name: storagenode
scrape_interval: 30s
scrape_timeout: 20s
metrics_path: /
static_configs:
- targets: ["storj-exporter1:9651"]
labels:
instance: "node1"
Might be worth adding to the readme as an alternative approach?
I Implemented HTTPRequestHandler with /metrics and /status endpoints here https://github.com/anclrii/Storj-Exporter/commit/e74d0001198415e54a46788a4b563ca8e0f1a8b1
This will only collect metrics if /metrics is called and should stop excessive cpu usage for anything else opening exporter port. Requires metrics_path: /metrics
in prometheus configuration. I'm not sure if netdata is also targeting /metrics endpoint though.
Also added /status endpoint that just returns {"status": "alive"}
and can be used for checking if exporter is running without hammering cpu.
@kevinkk525 can you try this with anclrii/storj-exporter:dev
image and confirm if this helps with netdata?
I think this polling of exporters must be configurable in netdata, it's some kind of service discovery and most likely can be disabled. @kevinkk525 did you try to disable this in netdata configs?
I like that implementation of metrics, however, it doesn't change anything for netdata. It still finds that metrics endpoint and scrapes all the data every 5 seconds.. I'm sure it's some module for automatically detecting those endpoints: https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/prometheus And it finds the storj exporter by default, so users would need to disable that module themselves. I haven't tried yet how to disable it, just verified that there's automatically a job configured that scrapes the exporter.
Ok cool, thanks for confirming, it seems a bit irresponsible of netdata to do that because some exporters are more expensive then others and it can easily monitor servers to death doing this :)
If you find how to disable it, please share the steps or PR to readme, we need "Known issues" section there. Also might be worth raising this issue with netdata to disable endpoint polling by default.
On our side of things I can try to implement a sort of caching/throttling of calls to storj api. I think I can make it optional and make the refresh interval configurable via an env var. What do you think would be a sane default refresh interval? I'm thinking either 60s (default prometheus scrape interval) or disabled(current exporter behaviour) depending on how common this issue is.
I'll make a PR when I find out.
Throttling interval..hmm I use a 30s scrape interval for my nodes. Might be too much :D Guess I wouldn't invest too much time into programming that. I need to find out how to disable the netdata polling and then we're good. Shouldn't be too complicated.
I'll probably fall back to reporting metrics on any endpoint to prevent breaking people's prometheus configurations pointing to other endpoints. It didn't help with original issue anyway. I might keep the /status endpoint though.
I guess it would be good to add an endpoint for such healthchecks that wouldn't trigger exporter to collect metrics as currently I think any connections to exporter port will trigger it.
Originally posted by @anclrii in https://github.com/anclrii/Storj-Exporter/issues/34#issuecomment-678236020