Open sumitsaluja opened 1 month ago
Hi Sumit,
so the fact that query results are returning no data means that there is either something wrong with the data collection process (e.g. prometheus is not scraping data on nodes where the job 874 ran) or a mismatch with what is in the prometheus (e.g. job data has no cluster=ganesha label or it is a wrong label ).
What might help narrow it down is if you go to the web interface of the prometheus server and try (on the graph tab) to search for some of this data. Say cgroup_memory_rss_bytes - start with all of it, do you get anything back?
If not check your prometheus and node configs and fix until you start getting data, especially for running jobs. Also make sure that there are jobid/step/task labels. If there is data but those are missing then you did not use the correct cgroup exporter - it has to be our modified version and not the original version.
Do labels look good? Next steps depend on what you get back - e.g. we've had a few issues where folks did not follow instructions on adding a cluster label to prometheus config, our instructions have an example on how to do that.
Hi Josko,
I tried to install jobstat but getting error:
./jobstats -d 874 DEBUG: jobidraw=874, start=1728671046, end=1728671263, cluster=ganesha, tres=cpu=2,gres/gpu=1,mem=4000M,node=1, data=, user=ss6478, account=sysops, state=COMPLETED, timelimit=90, nodes=1, ncpus=2, reqmem=4000M, qos=normal, partition=gpu, jobname=test DEBUG: jobid=874, jobidraw=874, start=1728671046, end=1728671263, gpus=1, diff=217, cluster=ganesha, data=, timelimitraw=90 DEBUG: query=max_over_time(cgroup_memory_total_bytes{cluster='ganesha',jobid='874',step='',task=''}[217s]), time=1728671263 DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}} DEBUG: query=max_over_time(cgroup_memory_rss_bytes{cluster='ganesha',jobid='874',step='',task=''}[217s]), time=1728671263 DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}} DEBUG: query=max_over_time(cgroup_cpu_total_seconds{cluster='ganesha',jobid='874',step='',task=''}[217s]), time=1728671263 DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}} DEBUG: query=max_over_time(cgroup_cpus{cluster='ganesha',jobid='874',step='',task=''}[217s]), time=1728671263 DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}} DEBUG: query=max_over_time((nvidia_gpu_memory_total_bytes{cluster='ganesha'} and nvidia_gpu_jobId == 874)[217s:]), time=1728671263 DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}} DEBUG: query=max_over_time((nvidia_gpu_memory_used_bytes{cluster='ganesha'} and nvidia_gpu_jobId == 874)[217s:]), time=1728671263 DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}} DEBUG: query=avg_over_time((nvidia_gpu_duty_cycle{cluster='ganesha'} and nvidia_gpu_jobId == 874)[217s:]), time=1728671263 DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}} Traceback (most recent call last): File "./jobstats", line 58, in
stats.report_job()
File "/tmp/jobstats/jobstats.py", line 582, in report_job
+f'If the run time was very short then try running "seff {self.jobid}".')
File "/tmp/jobstats/jobstats.py", line 115, in error
raise Exception(msg)
Exception: No stats found for job 874, either because it is too old or because
it expired from jobstats database. If you are not running this command on the
cluster where the job was run then use the -c option to specify the cluster.
If the run time was very short then try running "seff 874".
Could you please help?