DIRACGrid / DIRAC

DIRAC Grid
http://diracgrid.org
GNU General Public License v3.0
113 stars 174 forks source link

Gigantic scores of DB12 recorded by DIRAC Pilots (from 50 up to 2250) #7760

Open IgorPelevanyuk opened 1 month ago

IgorPelevanyuk commented 1 month ago

Hello,

Recently I have noticed unrealisticly big values of DB12 benchmark. Normally it shold be somewhere between 0 and 45. But I noticed that during this summer we had pilots with jobs that recorded values up to 2250.

So far it is observed only on one computing resource with AlmaLinux and CPU Model Intel Gold 6242R. Here are zoomed version from our local monitoring. Every dot is a job, color is CPU model, X is DB12 value, Y is job walltime. image I have checked the databases:

mysql> use JobDB;
mysql> select JobID, CONVERT(Value USING utf8) from JobParameters where Name="CPUNormalizationFactor" and CONVERT(Value USING utf8)>45;
+---------+---------------------------+
| JobID   | CONVERT(Value USING utf8) |
+---------+---------------------------+
| 4298295 | 60.7                      |
| 4298419 | 45.8                      |
| 4298478 | 48.8                      |
| 4298494 | 48.8                      |
+---------+---------------------------+
4 rows in set (0.48 sec)

Unfortunately, jobs with the craziest DB12 benchmark was cleand from DIRAC and there are only resent jobs with wrong DB12 benchmark. But I was able to grep DIRAC Pilot logs on the computing resource to get this:

2024-06-28T08:38:56.134271Z INFO [ConfigureCPURequirements] Command ConfigureCPURequirements instantiated from pilotCommands
2024-06-28T08:38:56.135707Z INFO [ConfigureCPURequirements] Executing command dirac-wms-cpu-normalization -U  -R pilot.cfg --cfg pilot.cfg -d
Computing benchmark measurements on 1 processor(s)...
Applying a correction on the CPU power: 1.0
Estimated CPU power is 2272.7 HS06
Configuration file dumped 'pilot.cfg'

2024-06-28T08:39:17.640561Z INFO [ConfigureCPURequirements] Current normalized CPU as determined by 'dirac-wms-cpu-normalization' is 2272.700000

Does anybody saw with this before?

IgorPelevanyuk commented 1 month ago

I also have to notice that the working nodes are virtualized with KVM. I did small research about this problem. Benchmark runs from CVMFS version of DIRAC. And sometimes, when CVMFS is doing something as a process on the worknode, the benchmark processes start to consume more than 100% of CPU in htop. I created simple script to run 40 dirac-wms-cpu-normalization processes in a background. This scrip was executed consequently several times. On the first screenshot there are previous results of normalization (left) and htop monitoring of current script run. Screenshot from 2024-08-23 14-31-18

On the second screenshot there are results of 40 dirac-wms-cpu-normalization which were received after the test. Screenshot from 2024-08-23 14-31-24

As you can see, the normal benchmark is within 35-37 range. When "something" happens, it may grow up to 55. And even more when we have longer test done by real dirac pilots.

IgorPelevanyuk commented 1 month ago

And I also created a test version of DB12 benchmark to test it. And so far it looks like the os.times function works wrong:

The norm calculated as: CodeCogsEqn

To record times os.times() function is used. Start values are recorded in the beginnig of benchmark workload iterations. End values are recorded after benchmark workload. In the following listing CPU represents exactly that: (end u+s+cu+cs) - (start u+s+cu+cs). And I was able to record really strange values:

7 CPU: 9.65     WALL: 12.94     NORM: 25.91     ST: u: 7.14 s: 0.01 cu:0.0 cs:0.0 E: u: 16.79 s: 0.01 cu:0.0 cs:0.0
10 CPU: 9.19    WALL: 13.21     NORM: 27.20     ST: u: 7.18 s: 0.01 cu:0.0 cs:0.0 E: u: 16.37 s: 0.01 cu:0.0 cs:0.0
20 CPU: 7.13    WALL: 13.63     NORM: 35.06     ST: u: 7.11 s: 0.01 cu:0.0 cs:0.0 E: u: 14.1 s: 0.15 cu:0.0 cs:0.0
 4 CPU: 6.66    WALL: 12.87     NORM: 37.54     ST: u: 7.26 s: 0.01 cu:0.0 cs:0.0 E: u: 13.92 s: 0.01 cu:0.0 cs:0.0
 3 CPU: 8.71    WALL: 13.29     NORM: 28.70     ST: u: 7.12 s: 0.0 cu:0.0 cs:0.0 E: u: 15.83 s: 0.0 cu:0.0 cs:0.0
 1 CPU: 6.79    WALL: 13.04     NORM: 36.82     ST: u: 7.22 s: 0.01 cu:0.0 cs:0.0 E: u: 14.01 s: 0.01 cu:0.0 cs:0.0
31 CPU: 4.75    WALL: 13.82     NORM: 52.63     ST: u: 7.15 s: 0.0 cu:0.0 cs:0.0 E: u: 11.9 s: 0.0 cu:0.0 cs:0.0
17 CPU: 8.74    WALL: 12.91     NORM: 28.60     ST: u: 6.63 s: 0.0 cu:0.0 cs:0.0 E: u: 15.37 s: 0.0 cu:0.0 cs:0.0
19 CPU: 6.92    WALL: 13.51     NORM: 36.13     ST: u: 7.51 s: 0.01 cu:0.0 cs:0.0 E: u: 14.42 s: 0.02 cu:0.0 cs:0.0
39 CPU: 3.92    WALL: 13.40     NORM: 63.78     ST: u: 7.24 s: 0.01 cu:0.0 cs:0.0 E: u: 11.16 s: 0.01 cu:0.0 cs:0.0
22 CPU: 4.09    WALL: 14.45     NORM: 61.12     ST: u: 7.14 s: 0.0 cu:0.0 cs:0.0 E: u: 11.23 s: 0.0 cu:0.0 cs:0.0
24 CPU: 7.44    WALL: 12.40     NORM: 33.60     ST: u: 7.08 s: 0.49 cu:0.0 cs:0.0 E: u: 14.51 s: 0.5 cu:0.0 cs:0.0
23 CPU: 2.45    WALL: 12.56     NORM: 102.04    ST: u: 7.31 s: 0.01 cu:0.0 cs:0.0 E: u: 9.76 s: 0.01 cu:0.0 cs:0.0
35 CPU: 2.93    WALL: 12.25     NORM: 85.32     ST: u: 7.56 s: 0.03 cu:0.0 cs:0.0 E: u: 10.49 s: 0.03 cu:0.0 cs:0.0
32 CPU: 12.03   WALL: 12.41     NORM: 20.78     ST: u: 7.38 s: 0.01 cu:0.0 cs:0.0 E: u: 19.41 s: 0.01 cu:0.0 cs:0.0
21 CPU: 6.72    WALL: 12.76     NORM: 37.20     ST: u: 6.89 s: 0.0 cu:0.0 cs:0.0 E: u: 13.61 s: 0.0 cu:0.0 cs:0.0
 5 CPU: 2.70    WALL: 13.76     NORM: 92.59     ST: u: 7.52 s: 0.01 cu:0.0 cs:0.0 E: u: 10.22 s: 0.01 cu:0.0 cs:0.0
26 CPU: 6.89    WALL: 12.13     NORM: 36.28     ST: u: 7.36 s: 0.01 cu:0.0 cs:0.0 E: u: 14.25 s: 0.01 cu:0.0 cs:0.0
12 CPU: 2.45    WALL: 15.09     NORM: 102.04    ST: u: 6.73 s: 0.01 cu:0.0 cs:0.0 E: u: 9.18 s: 0.01 cu:0.0 cs:0.0
38 CPU: 1.69    WALL: 12.61     NORM: 147.93    ST: u: 7.18 s: 0.0 cu:0.0 cs:0.0 E: u: 8.87 s: 0.0 cu:0.0 cs:0.0
11 CPU: 6.73    WALL: 12.72     NORM: 37.15     ST: u: 7.53 s: 0.0 cu:0.0 cs:0.0 E: u: 14.25 s: 0.01 cu:0.0 cs:0.0
33 CPU: 6.84    WALL: 12.10     NORM: 36.55     ST: u: 6.38 s: 0.0 cu:0.0 cs:0.0 E: u: 13.22 s: 0.0 cu:0.0 cs:0.0
36 CPU: 6.74    WALL: 12.25     NORM: 37.09     ST: u: 7.42 s: 0.01 cu:0.0 cs:0.0 E: u: 14.16 s: 0.01 cu:0.0 cs:0.0
37 CPU: 6.66    WALL: 12.16     NORM: 37.54     ST: u: 7.44 s: 0.01 cu:0.0 cs:0.0 E: u: 14.1 s: 0.01 cu:0.0 cs:0.0
 6 CPU: 6.53    WALL: 12.13     NORM: 38.28     ST: u: 7.37 s: 0.0 cu:0.0 cs:0.0 E: u: 13.9 s: 0.0 cu:0.0 cs:0.0
30 CPU: 6.62    WALL: 11.91     NORM: 37.76     ST: u: 6.87 s: 0.01 cu:0.0 cs:0.0 E: u: 13.49 s: 0.01 cu:0.0 cs:0.0
34 CPU: 12.15   WALL: 13.94     NORM: 20.58     ST: u: 6.9 s: 0.0 cu:0.0 cs:0.0 E: u: 19.04 s: 0.01 cu:0.0 cs:0.0
25 CPU: 6.76    WALL: 12.28     NORM: 36.98     ST: u: 6.85 s: 0.0 cu:0.0 cs:0.0 E: u: 13.61 s: 0.0 cu:0.0 cs:0.0
13 CPU: 3.49    WALL: 6.64      NORM: 71.63     ST: u: 12.65 s: 0.01 cu:0.0 cs:0.0 E: u: 16.14 s: 0.01 cu:0.0 cs:0.0
 9 CPU: 10.24   WALL: 17.11     NORM: 24.41     ST: u: 7.19 s: 0.01 cu:0.0 cs:0.0 E: u: 17.43 s: 0.01 cu:0.0 cs:0.0
40 CPU: 12.53   WALL: 16.63     NORM: 19.95     ST: u: 7.47 s: 0.0 cu:0.0 cs:0.0 E: u: 19.99 s: 0.01 cu:0.0 cs:0.0
28 CPU: 3.93    WALL: 6.94      NORM: 63.61     ST: u: 10.16 s: 0.38 cu:0.0 cs:0.0 E: u: 14.09 s: 0.38 cu:0.0 cs:0.0
18 CPU: 4.12    WALL: 6.76      NORM: 60.68     ST: u: 6.86 s: 0.0 cu:0.0 cs:0.0 E: u: 10.98 s: 0.0 cu:0.0 cs:0.0
15 CPU: 2.18    WALL: 7.31      NORM: 114.68    ST: u: 7.37 s: 0.0 cu:0.0 cs:0.0 E: u: 9.55 s: 0.0 cu:0.0 cs:0.0
16 CPU: 9.85    WALL: 17.35     NORM: 25.38     ST: u: 7.2 s: 0.01 cu:0.0 cs:0.0 E: u: 17.05 s: 0.01 cu:0.0 cs:0.0
 8 CPU: 5.07    WALL: 6.71      NORM: 49.31     ST: u: 9.48 s: 0.01 cu:0.0 cs:0.0 E: u: 14.55 s: 0.01 cu:0.0 cs:0.0
14 CPU: 5.66    WALL: 8.05      NORM: 44.17     ST: u: 6.68 s: 0.01 cu:0.0 cs:0.0 E: u: 12.34 s: 0.01 cu:0.0 cs:0.0
 2 CPU: 6.11    WALL: 6.82      NORM: 40.92     ST: u: 7.66 s: 0.01 cu:0.0 cs:0.0 E: u: 13.77 s: 0.01 cu:0.0 cs:0.0
29 CPU: 6.15    WALL: 6.37      NORM: 40.65     ST: u: 12.18 s: 0.01 cu:0.0 cs:0.0 E: u: 18.33 s: 0.01 cu:0.0 cs:0.0
27 CPU: 5.55    WALL: 6.31      NORM: 45.05     ST: u: 11.84 s: 0.01 cu:0.0 cs:0.0 E: u: 17.39 s: 0.01 cu:0.0 cs:0.0

That is listing when estimations are done sequentially:

  1 CPU: 6.29    WALL: 6.31      NORM: 39.75     ST: u: 6.24 s: 0.0 cu:0.0 cs:0.0 E: u: 12.53 s: 0.0 cu:0.0 cs:0.0
 2 CPU: 6.29    WALL: 6.31      NORM: 39.75     ST: u: 6.26 s: 0.0 cu:0.0 cs:0.0 E: u: 12.55 s: 0.0 cu:0.0 cs:0.0
 3 CPU: 6.42    WALL: 6.44      NORM: 38.94     ST: u: 6.21 s: 0.0 cu:0.0 cs:0.0 E: u: 12.63 s: 0.0 cu:0.0 cs:0.0
 4 CPU: 6.21    WALL: 6.24      NORM: 40.26     ST: u: 6.2 s: 0.0 cu:0.0 cs:0.0 E: u: 12.41 s: 0.0 cu:0.0 cs:0.0
 5 CPU: 6.21    WALL: 6.23      NORM: 40.26     ST: u: 6.18 s: 0.0 cu:0.0 cs:0.0 E: u: 12.39 s: 0.0 cu:0.0 cs:0.0
 6 CPU: 6.23    WALL: 6.25      NORM: 40.13     ST: u: 6.43 s: 0.0 cu:0.0 cs:0.0 E: u: 12.66 s: 0.0 cu:0.0 cs:0.0
 7 CPU: 6.24    WALL: 6.27      NORM: 40.06     ST: u: 6.25 s: 0.0 cu:0.0 cs:0.0 E: u: 12.49 s: 0.0 cu:0.0 cs:0.0
 8 CPU: 6.29    WALL: 6.31      NORM: 39.75     ST: u: 6.3 s: 0.0 cu:0.0 cs:0.0 E: u: 12.59 s: 0.0 cu:0.0 cs:0.0
 9 CPU: 6.24    WALL: 6.26      NORM: 40.06     ST: u: 6.2 s: 0.0 cu:0.0 cs:0.0 E: u: 12.44 s: 0.0 cu:0.0 cs:0.0
10 CPU: 6.23    WALL: 6.25      NORM: 40.13     ST: u: 6.2 s: 0.0 cu:0.0 cs:0.0 E: u: 12.43 s: 0.0 cu:0.0 cs:0.0
11 CPU: 6.24    WALL: 6.27      NORM: 40.06     ST: u: 6.26 s: 0.0 cu:0.0 cs:0.0 E: u: 12.5 s: 0.0 cu:0.0 cs:0.0
12 CPU: 6.25    WALL: 6.26      NORM: 40.00     ST: u: 6.22 s: 0.0 cu:0.0 cs:0.0 E: u: 12.47 s: 0.0 cu:0.0 cs:0.0

Normally CPU is around 6.25. But for the biggest benchmark from the first listing the CPU value may be 2. Which means that the benchmark workload was executed 3 times faster than it should, which is impossible. I believe that this situation occures only when there are more processes than the cores. As shown on the screenshot of my previous post, the CVMFS process started to run alongside with benchmark tests.

It is the first time I observe this within DIRAC environment. And it is only on one particular resource. It is counter intuitive, because benchmark results are used to calculate normalized CPU time. But looks that if on some resource more than 1 job running on one core, the benchmark could be higher!

I will try to make more tests on the other server which is neither KVM virtualized, nor Alma Linux.

BTW: DIRAC version v8.0.36

fstagni commented 1 month ago

Thank you @IgorPelevanyuk for the notification and the extensive tests. At a minimum, they show that there some fixes to do. Please report back your findings and your suggestion.

EDIT: we do not have much development workforce at the moment, so you are of course welcome to take care of what will need to be done.

IgorPelevanyuk commented 1 month ago

Hello, @fstagni!

So far, I think that the main culprit of the issues is ProxmoxVE on Debian. No similar issues found on the same cluster but with Libvirt on Fedora. We will try to make more tests to understand the reason(in two weeks), why Proxmox behaves so strange.

It is hard to tell somethin about solution. The issue appears only in a very specific virtual environment. I hope that it can be fixed by some setting in Proxmox.

fstagni commented 2 weeks ago

Just for info, are you also using cgroups on these cluster?

IgorPelevanyuk commented 1 week ago

Slurm on this particular resource is configured to use cgroups. But other Slurm resources also use it.