CCI-MOC / xdmod-cntr

A project to prototype the use of XDMOD with OpenStack and OpenShift on the MOC
1 stars 5 forks source link

Issue: summarize_jobs.py Command Not Processing Jobs #231

Open mauw10 opened 3 months ago

mauw10 commented 3 months ago

Hello Support Team,

I am encountering an issue with the summarize_jobs.py command on my CentOS 7 system. When I run the command:

[root@centos7 bin]# summarize_jobs.py -d

I receive the following output:

2024-06-26T14:14:13.600 [DEBUG] Using config file /usr/lib64/python2.7/site-packages/supremm-1.4.1-py2.7-linux-x86_64.egg/etc/supremm/config.json 2024-06-26T14:14:13.602 [DEBUG] Loaded 3 preprocessors 2024-06-26T14:14:13.605 [WARNING] Autoperiod library not found, TimeseriesPatterns plugins will not do period analysis 2024-06-26T14:14:13.606 [DEBUG] Loaded 35 plugins 2024-06-26T14:14:13.606 [INFO] Processing resource clusterbioproves 2024-06-26T14:14:13.606 [DEBUG] Using 3 preprocessors 2024-06-26T14:14:13.606 [DEBUG] Using 35 plugins 2024-06-26T14:14:13.612 [WARNING] /usr/lib64/python2.7/site-packages/pymongo/mongo_client.py:343: UserWarning: database name or authSource in URI is being ignored. If you wish to authenticate to supremm, you must provide a username and password. "must provide a username and password." % (db_name,)) 2024-06-26T14:14:13.639 [INFO] Processing 0 jobs [root@centos7 bin]#

As you can see, it is not processing any jobs. However, when I run the indexarchives.py command:

[root@centos7 bin]# indexarchives.py -a -d

It processes the archives correctly, as shown below:

2024-06-26T14:16:39.331 [DEBUG] Using config file /usr/lib64/python2.7/site-packages/supremm-1.4.1-py2.7-linux-x86_64.egg/etc/supremm/config.json 2024-06-26T14:16:39.332 [INFO] archive indexer starting 2024-06-26T14:16:39.338 [DEBUG] processed archive /data/clusterbioproves/pmlogger/2024/06/mdrvpremst01/2024-06-25/20240625.10.55.index (fileio 0.00240302085876, dbacins 4.29153442383e-05) 2024-06-26T14:16:39.343 [DEBUG] processed archive /data/clusterbioproves/pmlogger/2024/06/mdrvpremst01/2024-06-25/20240625.11.13.index (fileio 0.00458288192749, dbacins 1.50203704834e-05) 2024-06-26T14:16:39.344 [DEBUG] processed archive /data/clusterbioproves/pmlogger/2024/06/mdrvpremst01/2024-06-25/job--begin-20240625.13.56.39.index (fileio 0.000778913497925, dbacins 8.89301300049e-05) 2024-06-26T14:16:39.346 [DEBUG] processed archive /data/clusterbioproves/pmlogger/2024/06/mdrvpremst01/2024-06-25/job--end-20240625.13.56.37.index (fileio 0.00105690956116, dbacins 1.31130218506e-05) 2024-06-26T14:16:39.346 [DEBUG] processed archive /data/clusterbioproves/pmlogger/2024/06/mdrvpremst01/2024-06-26/20240626.00.10.index (fileio 0.000596046447754, dbacins 8.82148742676e-06) 2024-06-26T14:16:39.379 [INFO] archive indexer complete [root@centos7 bin]#

The directory contains the start and end job files:

[root@centos7 2024-06-25]# ls -l total 2696 -rw-rw-r--. 1 centos centos 4492 jun 25 11:13 20240625.10.55.0.xz -rw-rw-r--. 1 centos centos 252 jun 25 11:13 20240625.10.55.index -rw-rw-r--. 1 centos centos 13584 jun 25 11:11 20240625.10.55.meta.xz -rw-rw-r--. 1 centos centos 2336104 jun 26 00:10 20240625.11.13.0 -rw-rw-r--. 1 centos centos 792 jun 26 00:10 20240625.11.13.index -rw-rw-r--. 1 centos centos 116479 jun 25 18:53 20240625.11.13.meta -rw-rw-r--. 1 centos centos 29200 jun 25 13:56 job--begin-20240625.13.56.39.0 -rw-rw-r--. 1 centos centos 252 jun 25 13:56 job--begin-20240625.13.56.39.index -rw-rw-r--. 1 centos centos 76596 jun 25 13:56 job--begin-20240625.13.56.39.meta -rw-rw-r--. 1 centos centos 23080 jun 25 13:56 job--end-20240625.13.56.37.0 -rw-rw-r--. 1 centos centos 232 jun 25 13:56 job--end-20240625.13.56.37.index -rw-rw-r--. 1 centos centos 76596 jun 25 13:56 job--end-20240625.13.56.37.meta -rw-rw-r--. 1 centos centos 29167 jun 26 00:10 pmlogger.log -rw-rw-r--. 1 centos centos 15565 jun 25 11:13 pmlogger.log.prior [root@centos7 2024-06-25]# pwd /data/clusterbioproves/pmlogger/2024/06/mdrvpremst01/2024-06-25 [root@centos7 2024-06-25]#

When I performed the initial job ingestion and subsequently executed indexarchives.py -a -d and summarize_jobs.py -d, it added data to the supremm database in MongoDB. The output of the command was:

[root@centos7 shm]# summarize_jobs.py -d 2024-06-26T14:00:20.480 [DEBUG] Using config file /usr/lib64/python2.7/site-packages/supremm-1.4.1-py2.7-linux-x86_64.egg/etc/supremm/config.json 2024-06-26T14:00:20.482 [DEBUG] Loaded 3 preprocessors 2024-06-26T14:00:20.494 [WARNING] Autoperiod library not found, TimeseriesPatterns plugins will not do period analysis 2024-06-26T14:00:20.495 [DEBUG] Loaded 35 plugins 2024-06-26T14:00:20.496 [INFO] Processing resource clusterbioproves 2024-06-26T14:00:20.496 [DEBUG] Using 3 preprocessors 2024-06-26T14:00:20.496 [DEBUG] Using 35 plugins 2024-06-26T14:00:20.507 [WARNING] /usr/lib64/python2.7/site-packages/pymongo/mongo_client.py:343: UserWarning: database name or authSource in URI is being ignored. If you wish to authenticate to supremm, you must provide a username and password. "must provide a username and password." % (db_name,)) 2024-06-26T14:00:20.544 [INFO] Processing 7 jobs 2024-06-26T14:00:20.549 [INFO] Skipping 1, skipped_noarchives 2024-06-26T14:00:20.623 [INFO] Skipping 2, skipped_noarchives 2024-06-26T14:00:20.644 [INFO] Skipping 3, skipped_noarchives 2024-06-26T14:00:20.650 [INFO] Skipping 4, skipped_noarchives 2024-06-26T14:00:20.655 [INFO] Skipping 5, skipped_noarchives 2024-06-26T14:00:20.660 [INFO] Skipping 6, skipped_noarchives 2024-06-26T14:00:20.664 [INFO] Skipping 7, skipped_noarchives [root@centos7 shm]#

However, the issue is that most metrics are not displayed. For example, metrics like Avg: Total Memory: Per Core weighted by core-hour, Avg CPU %: System: weighted by core-hour, etc., do not appear. Only a few metrics are shown. I have already performed an initial ingestion, and there are jobs that are displayed in the XDMoD interface.

Please advise on why summarize_jobs.py is not processing the jobs and how to resolve this issue.

mauw10 commented 3 months ago

Additionally, the PCP files are created and contain information. For instance, when I query the PCP log file for job--end-20240625.13.56.37.0:

sysadmin@mdrvpremst01:/data/clusterbioproves/pmlogger/2024/06/mdrvpremst01/2024-06-25$ pmdumplog -a job--end-20240625.13.56.37.0 : 60.1.15 (mem.util.inactive): value 1588336 60.1.14 (mem.util.active): value 928364 60.1.13 (mem.util.swapCached): value 0 60.1.12 (mem.util.other): value 1477032 60.1.11 (hinv.pagesize): value 4096 60.1.10 (mem.freemem): value 360708 60.1.9 (hinv.physmem): value 3906 60.1.8 (swap.free): value 4088393728 60.1.5 (mem.util.cached): value 1800244 60.1.4 (mem.util.bufmem): value 362284 60.1.3 (mem.util.shared): No values returned! 60.1.2 (mem.util.free): value 360708 60.1.1 (mem.util.used): value 3639560 60.1.0 (mem.physmem): value 4000268 60.0.75 (disk.all.write_rawactive): value 117145270 60.0.74 (disk.all.read_rawactive): value 45351 60.0.73 (disk.dev.write_rawactive): inst [0 or "sda"] value 117145270 60.0.72 (disk.dev.read_rawactive): inst [0 or "sda"] value 45351 60.0.57 (kernel.percpu.cpu.irq.hard): inst [0 or "cpu0"] value 0 60.0.56 (kernel.percpu.cpu.irq.soft): inst [0 or "cpu0"] value 5055140 60.0.54 (kernel.all.cpu.irq.hard): value 0 60.0.53 (kernel.all.cpu.irq.soft): value 5055140 60.0.52 (disk.all.write_merge): value 954208 60.0.51 (disk.all.read_merge): value 8792 60.0.50 (disk.dev.write_merge): inst [0 or "sda"] value 954208 60.0.49 (disk.dev.read_merge): inst [0 or "sda"] value 8792 60.0.47 (disk.dev.aveq): inst [0 or "sda"] value 119078932 60.0.46 (disk.dev.avactive): inst [0 or "sda"] value 7795644 60.0.45 (disk.all.aveq): value 119078932 60.0.44 (disk.all.avactive): value 7795644 60.0.42 (disk.all.write_bytes): value 17907401 60.0.41 (disk.all.read_bytes): value 812463 60.0.39 (disk.dev.write_bytes): inst [0 or "sda"] value 17907401 60.0.38 (disk.dev.read_bytes): inst [0 or "sda"] value 812463 60.0.35 (kernel.all.cpu.wait.total): value 5538350 60.0.34 (kernel.all.cpu.intr): value 5055140 60.0.33 (hinv.ndisk): value 1 60.0.32 (hinv.ncpu): value 1 60.0.31 (kernel.percpu.cpu.intr): inst [0 or "cpu0"] value 5055140 60.0.30 (kernel.percpu.cpu.wait.total): inst [0 or "cpu0"] value 5538350 60.0.29 (disk.all.total): value 2046929 60.0.28 (disk.dev.total): inst [0 or "sda"] value 2046929 60.0.25 (disk.all.write): value 2012993 60.0.24 (disk.all.read): value 33936 60.0.23 (kernel.all.cpu.idle): value 1032698290 60.0.22 (kernel.all.cpu.sys): value 2384580 60.0.21 (kernel.all.cpu.nice): value 71920 60.0.20 (kernel.all.cpu.user): value 3572700 60.0.9 (swap.pagesout): value 0 60.0.8 (swap.pagesin): value 0 60.0.5 (disk.dev.write): inst [0 or "sda"] value 2012993 60.0.4 (disk.dev.read): inst [0 or "sda"] value 33936 60.0.3 (kernel.percpu.cpu.idle): inst [0 or "cpu0"] value 1032698290 60.0.2 (kernel.percpu.cpu.sys): inst [0 or "cpu0"] value 2384580 60.0.1 (kernel.percpu.cpu.nice): inst [0 or "cpu0"] value 71920 60.0.0 (kernel.percpu.cpu.user): inst [0 or "cpu0"] value 3572700

[256 bytes] 13:56:37.156456 5 metrics 2.3.3 (pmcd.pmlogger.host): inst [483049 or "483049"] value "mdrvpremst01.vhio.org" 2.3.0 (pmcd.pmlogger.port): inst [483049 or "483049"] value 4331 2.3.2 (pmcd.pmlogger.archive): inst [483049 or "483049"] value "/data/clusterbioproves/pmlogger/2024/06/mdrvpremst01/2024-06-25/job--end-20240625.13.56.37" 2.0.23 (pmcd.pid): value 474114 2.0.24 (pmcd.seqnum): value 14 This PCP log provides detailed metrics related to system performance and resource utilization during the specified job end time.