anclrii / Storj-Exporter

Prometheus exporter for monitoring Storj storage nodes
GNU General Public License v3.0
59 stars 19 forks source link

Add an endpoint for healthchecks that wouldn't trigger exporter to collect metrics #35

Closed anclrii closed 3 years ago

anclrii commented 4 years ago

I guess it would be good to add an endpoint for such healthchecks that wouldn't trigger exporter to collect metrics as currently I think any connections to exporter port will trigger it.

Originally posted by @anclrii in https://github.com/anclrii/Storj-Exporter/issues/34#issuecomment-678236020

kevinkk525 commented 4 years ago

Maybe the simplest workaround for the moment would be to change the way the exporter polls data. Like let it collect the node metrics at a configureable interval (e.g. 30 seconds) and store those values internally. Then only respond to scraping requests (and whatever netdata does) with that stored data. That would reduce the load we're all seeing but doesn't require much programming to implement.

kevinkk525 commented 4 years ago

ok I just tested it and it has absolutely no effect. The problem seems to be that every request is taking a lot of resources and time and not the polling of the storagenode. The polling of the storagenode api actually only takes 3-5ms but calling "curl localhost:9651" on a script that only returns the stored values takes 1.5s!

Here are the modified parts of your script btw:

class StorjCollector(object):
    def __init__(self):
        self.storj_host_address = os.environ.get('STORJ_HOST_ADDRESS', '127.0.0.1')
        self.storj_api_port = os.environ.get('STORJ_API_PORT', '14002')
        self.data = {}

    def get_data(self):
        print("Requested data")
        return self.data

    def poll(self):
        a = time.time()
        self.data = self.call_api("sno/")
        b = time.time()
        print("poll took", b - a)

...

if __name__ == "__main__":
    storj_exporter_port = os.environ.get('STORJ_EXPORTER_PORT', '9651')
    storj_polling_interval = int(os.environ.get('STORJ_POLLING_INTERVAL', 30))
    collector = StorjCollector()
    collector.poll()
    REGISTRY.register(collector)
    start_http_server(int(storj_exporter_port))
    while True:
        time.sleep(storj_polling_interval)
        collector.poll()
kevinkk525 commented 4 years ago

I thought it has something to do with docker and used the script without docker but that doesn't change anything. Every 5s there is a spike in the storagenode too when the script is running. So either I have no idea or I did something wrong.. Behaviour is the same with both scripts so maybe I just don't quite understand how the prometheus client works..

kevinkk525 commented 4 years ago

Thanks for the update, it works great! However, it makes the CPU spikes even worse: grafik

And with 3 nodes on the same host it's noticeable.

kevinkk525 commented 4 years ago

I posted this in the forum too: It turns out that netdata was by default discovering the exporter endpoint and polling it every 5 seconds. As a result all data got stored in prometheus twice if you scrape the exporter and netdata.

I got rid of netdata polling the exporter endpoint by using docker-compose to connect the exporter to prometheus without exposing any port to the host. This significantly reduced the CPU spikes.

Here's the docker-compose.yml I used:

version: '3.7'

services:
  storagenode:
    image: storjlabs/storagenode:latest
    container_name: storagenode1
    user: "1000:1000"
    restart: unless-stopped
    ports:
      - 7777:7770
      - 14002:14002
      - 28967:28967
    environment:
      - WALLET=
      - EMAIL=
      - ADDRESS=
      - STORAGE=9TB
    volumes:
      - type: bind
        source: /media/STORJ/STORJ
        target: /app/config
      - type: bind
        source: /media/STORJ/identity
        target: /app/identity
    networks:
      - default
    stop_grace_period: 300s
    deploy:
      resources:
        limits:
          memory: 4096M

  storj-exporter:
    image: anclrii/storj-exporter:latest
    container_name: storj-exporter1
    user: "1000:1000"
    restart: unless-stopped
    environment:
      - STORJ_HOST_ADDRESS=storagenode1
      - STORJ_API_PORT=14002
      - STORJ_EXPORTER_PORT=9651
    networks:
      - default

  prometheus:
    image: prom/prometheus
    container_name: prometheus
    user: "1000:1000"
    ports:
      - 9090:9090
    volumes:
      - /sharedfolders/config/prometheus.yml:/etc/prometheus/prometheus.yml 
      - type: bind
        source: /sharedfolders/prometheus
        target: /prometheus
    restart: unless-stopped
    command: --storage.tsdb.retention.time=720d --storage.tsdb.retention.size=30GB --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/prometheus
    networks:
      - default

networks:
  default:

and in promtheus.yml:

  - job_name: storagenode
    scrape_interval: 30s
    scrape_timeout: 20s
    metrics_path: /
    static_configs:
      - targets: ["storj-exporter1:9651"]
        labels:
          instance: "node1"

Might be worth adding to the readme as an alternative approach?

anclrii commented 3 years ago

I Implemented HTTPRequestHandler with /metrics and /status endpoints here https://github.com/anclrii/Storj-Exporter/commit/e74d0001198415e54a46788a4b563ca8e0f1a8b1

This will only collect metrics if /metrics is called and should stop excessive cpu usage for anything else opening exporter port. Requires metrics_path: /metrics in prometheus configuration. I'm not sure if netdata is also targeting /metrics endpoint though.

Also added /status endpoint that just returns {"status": "alive"} and can be used for checking if exporter is running without hammering cpu.

@kevinkk525 can you try this with anclrii/storj-exporter:dev image and confirm if this helps with netdata?

anclrii commented 3 years ago

I think this polling of exporters must be configurable in netdata, it's some kind of service discovery and most likely can be disabled. @kevinkk525 did you try to disable this in netdata configs?

kevinkk525 commented 3 years ago

I like that implementation of metrics, however, it doesn't change anything for netdata. It still finds that metrics endpoint and scrapes all the data every 5 seconds.. I'm sure it's some module for automatically detecting those endpoints: https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/prometheus And it finds the storj exporter by default, so users would need to disable that module themselves. I haven't tried yet how to disable it, just verified that there's automatically a job configured that scrapes the exporter.

anclrii commented 3 years ago

Ok cool, thanks for confirming, it seems a bit irresponsible of netdata to do that because some exporters are more expensive then others and it can easily monitor servers to death doing this :)

If you find how to disable it, please share the steps or PR to readme, we need "Known issues" section there. Also might be worth raising this issue with netdata to disable endpoint polling by default.

On our side of things I can try to implement a sort of caching/throttling of calls to storj api. I think I can make it optional and make the refresh interval configurable via an env var. What do you think would be a sane default refresh interval? I'm thinking either 60s (default prometheus scrape interval) or disabled(current exporter behaviour) depending on how common this issue is.

kevinkk525 commented 3 years ago

I'll make a PR when I find out.

Throttling interval..hmm I use a 30s scrape interval for my nodes. Might be too much :D Guess I wouldn't invest too much time into programming that. I need to find out how to disable the netdata polling and then we're good. Shouldn't be too complicated.

anclrii commented 3 years ago

I'll probably fall back to reporting metrics on any endpoint to prevent breaking people's prometheus configurations pointing to other endpoints. It didn't help with original issue anyway. I might keep the /status endpoint though.