PrincetonUniversity / jobstats

Jobstats is a job monitoring platform for CPU and GPU clusters
https://princetonuniversity.github.io/jobstats/
GNU General Public License v2.0
53 stars 11 forks source link

Help with jobstat installation #16

Open sumitsaluja opened 1 month ago

sumitsaluja commented 1 month ago

Hi Josko,

I tried to install jobstat but getting error:

./jobstats -d 874 DEBUG: jobidraw=874, start=1728671046, end=1728671263, cluster=ganesha, tres=cpu=2,gres/gpu=1,mem=4000M,node=1, data=, user=ss6478, account=sysops, state=COMPLETED, timelimit=90, nodes=1, ncpus=2, reqmem=4000M, qos=normal, partition=gpu, jobname=test DEBUG: jobid=874, jobidraw=874, start=1728671046, end=1728671263, gpus=1, diff=217, cluster=ganesha, data=, timelimitraw=90 DEBUG: query=max_over_time(cgroup_memory_total_bytes{cluster='ganesha',jobid='874',step='',task=''}[217s]), time=1728671263 DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}} DEBUG: query=max_over_time(cgroup_memory_rss_bytes{cluster='ganesha',jobid='874',step='',task=''}[217s]), time=1728671263 DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}} DEBUG: query=max_over_time(cgroup_cpu_total_seconds{cluster='ganesha',jobid='874',step='',task=''}[217s]), time=1728671263 DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}} DEBUG: query=max_over_time(cgroup_cpus{cluster='ganesha',jobid='874',step='',task=''}[217s]), time=1728671263 DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}} DEBUG: query=max_over_time((nvidia_gpu_memory_total_bytes{cluster='ganesha'} and nvidia_gpu_jobId == 874)[217s:]), time=1728671263 DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}} DEBUG: query=max_over_time((nvidia_gpu_memory_used_bytes{cluster='ganesha'} and nvidia_gpu_jobId == 874)[217s:]), time=1728671263 DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}} DEBUG: query=avg_over_time((nvidia_gpu_duty_cycle{cluster='ganesha'} and nvidia_gpu_jobId == 874)[217s:]), time=1728671263 DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}} Traceback (most recent call last): File "./jobstats", line 58, in stats.report_job() File "/tmp/jobstats/jobstats.py", line 582, in report_job +f'If the run time was very short then try running "seff {self.jobid}".') File "/tmp/jobstats/jobstats.py", line 115, in error raise Exception(msg) Exception: No stats found for job 874, either because it is too old or because it expired from jobstats database. If you are not running this command on the cluster where the job was run then use the -c option to specify the cluster. If the run time was very short then try running "seff 874".

Could you please help?

plazonic commented 1 month ago

Hi Sumit,

so the fact that query results are returning no data means that there is either something wrong with the data collection process (e.g. prometheus is not scraping data on nodes where the job 874 ran) or a mismatch with what is in the prometheus (e.g. job data has no cluster=ganesha label or it is a wrong label ).

What might help narrow it down is if you go to the web interface of the prometheus server and try (on the graph tab) to search for some of this data. Say cgroup_memory_rss_bytes - start with all of it, do you get anything back?

If not check your prometheus and node configs and fix until you start getting data, especially for running jobs. Also make sure that there are jobid/step/task labels. If there is data but those are missing then you did not use the correct cgroup exporter - it has to be our modified version and not the original version.

Do labels look good? Next steps depend on what you get back - e.g. we've had a few issues where folks did not follow instructions on adding a cluster label to prometheus config, our instructions have an example on how to do that.