Differing queries - Githubissues

jounaidr commented 4 months ago

Hi,

Thanks for this project, it is exactly what I was looking for for a similar HPC system. I have managed to setup and run the platform with Grafana, however it required changes to the queries (just syntactically) and looking at the other issues it appears there's some discrepancy with the exporter versions. I believe I was using the modified cgroups exporter provided, but to get it working I had to change the queries to what the metric label was displaying in Prometheus. I am attempting to consolidate the queries into the config file for our setup so its would be somewhat easier to change them in the future if we were to add new exporters or if the existing exporter queries change. Also i'm thinking the issues could be caused by my Prometheus config as what was in the docs was not working so I just made a basic one (i am new to Prometheus :p). Please feel free to close this if its not acceptable!

here is my prom config:

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'node-monitor'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
            - targets: ['localhost:9100','localhost:9306','localhost:9821']

plazonic commented 4 months ago

Hello,

good to hear that you managed to get it working but it seems like you had some copy/paste problems? I do not see your configuration attached so not sure what you mean or what problem you had.

jounaidr commented 4 months ago

Ah yes sorry I updated the comment it should be there now! and thanks for the quick response!

jounaidr commented 4 months ago

and this is what I changed the queries to within jobstats:

    def get_job_stats(self):
        # query CPU and Memory utilization data
        self.get_data('total_memory', "max_over_time(cgroup_memory_total_bytes{cgroup='/slurm_localhost/uid_0/job_%s', instance='localhost:9306', job='prometheus'}[%ds])")
        self.get_data('used_memory', "max_over_time(cgroup_memory_rss_bytes{cgroup='/slurm_localhost/uid_0/job_%s', instance='localhost:9306', job='prometheus'}[%ds])")
        self.get_data('total_time', "max_over_time(cgroup_cpu_total_seconds{cgroup='/slurm_localhost/uid_0/job_%s', instance='localhost:9306', job='prometheus'}[%ds])")
        self.get_data('cpus', "max_over_time(cgroup_cpus{cgroup='/slurm_localhost/uid_0/job_%s', instance='localhost:9306', job='prometheus'}[%ds])")

plazonic commented 4 months ago

You need the cluster label in your scrape_config - it is mentioned in our main README and I also expanded on it in issue #8 - we expect that to be added at collection time.

jounaidr commented 4 months ago

Ok thanks, I was having issues with the labels previously but I will try and update the issue, tysm :)

jounaidr commented 4 months ago

Hi, so I updated the prom config, after some syntax issues it is happy and running:

global:
 scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
 external_labels:
  monitor: 'node-monitor'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
  scrape_interval: 5s

  file_sd_configs:
  - files:
    - "/root/prometheus-2.51.2.linux-amd64/localnodes.json"
  metric_relabel_configs:
  - source_labels:
    - __name__
    regex: "^go_.*"
    action: drop

and the nodes json:

 [
   {
     "labels": {
       "cluster": "localcluster",
       "service": "compute"
     },
     "targets": [
       "localhost:9100",
       "localhost:9821",
       "localhost:9306"
     ]
   }
 ]

However it still is only working with the changed queries :/, just to confirm I am using the following cgroups exporter: https://github.com/treydock/cgroup_exporter/releases/tag/v0.9.1

I just saw there's a new release so ill try that as well

plazonic commented 4 months ago

Hello,

no, it can't be treydock's exporter. It has to be one of the two modified versions on my github page (either master branch for cgroup v1 or cgroupv2 branch if you have something like rhel9 and are running cgroupv2). If it is working you will see jobid tags (if there are active jobs on the node), e.g.:

cgroup_memory_cache_bytes{jobid="57205080",step="",task=""} 1.16293632e+08

jounaidr commented 4 months ago

Ahhhh okay yeh makes sense, this is most likely the problem then I believe, I will try tomorrow and confirm, thanks :)

jounaidr commented 4 months ago

Hi, i've attempted to build the cgroup exporter from your repo a couple times on the V2 and master branches but it seems to always give me the same queries as before without the job id tag, possibly it is still pulling stuff from treydock's repo as it wont let me build straight with make, I have to run go get github.com/treydock/cgroup_exporter which pulls some things from the original repo I believe. Is there any chance you have the modified cgroup exporter binaries uploaded somewhere so I don't have to build them myself? Thanks!

plazonic commented 4 months ago

Hello,

first of all make sure you are using the correct branch - master is for cgroupv1 and cgroupv2 is, obviously for v2. The easiest way to recognize which one is which to just check mounts - there will be multiple cgroup mounts for v1 and only one cgroup2 mount for v2. E.g. for one of our systems:

cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate,memory_recursiveprot)

Next, check [https://github.com/PrincetonUniversity/jobstats/issues/8#issuecomment-2064599206](issue 8) where I answered a similar question and shown how to build as well as provided details on how to recognize that the build is correct.

jounaidr commented 4 months ago

Thanks, I believe I have the modified versions building now as the queries have changed! However they are still not correct I believe, for example cgroup_cpu_user_seconds{cluster="localcluster", instance="localhost:9306", job="prometheus", service="compute"} the jobID is omitted. This is for both v1 and the v2 branches, and as expected grafana no longer works when specifying job id with the new queries. Most likely made a simple mistake somewhere if you have any ideas :p

plazonic commented 3 weeks ago

Hullo, the issue 8 seemed to imply that you solved this? Or are you still having this problem?

jounaidr commented 3 weeks ago

Hi, this issue is still ongoing and have still had to use the modified jobstats script within our deployment, has there been any updates to the modified exporter repo or anyone with similar issues since we last had discussions? Thanks.

PrincetonUniversity / jobstats

Differing queries #12