Boavizta / boaviztapi

šŸ›  Giving access to BOAVIZTA reference data and methodologies trough a RESTful API
GNU Affero General Public License v3.0
66 stars 23 forks source link

Provide a way to generate function based on punctual consumption profile #86

Closed da-ekchajzer closed 1 year ago

da-ekchajzer commented 2 years ago

Problem

A consumption profile is a function which associate a workload to an electrical_consumption : consumption_profil(workload) = electrical_consumption

This continues function will be generated from punctual measures on different workload for a given configuration. The punctual measure could come from our measurement or from secondary sources.

We want to provide a way to generate continuous consumption profiles (function) from those punctual measures. Such process could be used for device or components usage impacts evaluation.

Solution

We should set up a regression process. We call regression the process of defining a continuous relationship (function) between workload and electrical_consumption based on a punctual measurement.

Regression shouldn't be linear. From what we have seen, consumption profile follow a logarithmic rule.

This might be a problem when only two points are given (min 0%, max 100% for instance) since we don't want a linear distribution. We could use existing consumption profile in the regression process.

Input value

Format

We should have this type of input format :

"workload":{
  "10":30.6,
  "20":34,
  "56":67,
  ...
  "workload %": "power_consumption"
}

Data example

Example for AWS server CPU from TEADS Link :

PkgWatt Idle (0%) PkgWatt CPUStress 10% PkgWatt CPUStress 20% PkgWatt CPUStress 30% PkgWatt CPUStress 40% PkgWatt CPUStress 50% PkgWatt CPUStress 60% PkgWatt CPUStress 70% PkgWatt CPUStress 80% PkgWatt CPUStress 90% PkgWatt CPUStress 100%
51 135 174 212 249 293 330 357 382 404 413
113 146 194 225 244 263 295 295 311 333 387
58 176 241 299 375 448 520 562 590 607 617
59 174 243 299 372 441 522 564 592 606 617
116 148 188 205 222 258 240 277 284 287 346
55 138 178 224 272 307 344 375 401 426 440
110 127 150 188 214 224 224 241 244 240 287
58 147 193 246 298 344 381 417 453 481 492
48 148 198 245 305 361 389 413 447 471 480
57 147 212 270 331 381 417 453 481 513 513
35 100 126 152 178 205 223 238 250 263 272
2 10 16 17 15 15 15 38 40 42 41
26 65 83 96 109 117 122 127 130 134 133
50 98 120 141 159 172 181 190 195 204 207
28 57 88 110 125 136 146 151 156 157 168
32 62 79 91 103 112 120 127 134 138 145
40 64 77 86 92 98 104 107 110 114 120
2 22 41 59 44 41 39 47 90 82 85
17 68 93 106 117 128 135 140 146 150 154
71 107 132 154 167 179 184 195 199 204 204
38 113 134 160 178 203 221 234 243 247 252

Example from specpower agregated by Cloud carbon footprint : Link

Architecture Min Watts (0%) Max Watts (100%)
Skylake 0.6446044454253452 4.193436438541878
Broadwell 0.7128342245989304 3.6853275401069516
Haswell 1.9005681818181814 6.012910353535353
EPYC 2nd Gen 0.4742621527777778 1.6929615162037037
Cascade Lake 0.6389493581523519 3.9673047343937564
EPYC 3rd Gen 0.44538981119791665 2.0193277994791665
Ivy Bridge 3.0369270833333335 8.248611111111112
Sandy Bridge 2.1694411458333334 8.575357663690477

Output value

A function described by its coefficients

da-ekchajzer commented 2 years ago

@samuelrince I would be interested in your opinion.

samuelrince commented 2 years ago

I have deep dived a little bit more using the original spreadsheet. Looks like making one logarithm-like function model per CPU "model" or "family" (Platinum, Gold, Silver, etc.) could be a good idea?

Platinum

image image

Here the red one is the line 6 from the spreadsheet, corresponding to c5.metal*. I don't know what the star means here neither why the red curve stands out that much. Looks like it should be on the other graph.

Gold

image

Silver

image

E, E3, E5

image

@da-ekchajzer, let me known what you think of this approach. I guess if can have the CPU model name we will have a better estimation of what the power consumption profile should look like for new CPU models.

samuelrince commented 2 years ago

Also I don't understand the second table, it's not the power consumption per CPU architecture right?

Architecture Min Watts (0%) Max Watts (100%)
Skylake 0.6446044454253452 4.193436438541878
Broadwell 0.7128342245989304 3.6853275401069516
Haswell 1.9005681818181814 6.012910353535353
EPYC 2nd Gen 0.4742621527777778 1.6929615162037037
Cascade Lake 0.6389493581523519 3.9673047343937564
EPYC 3rd Gen 0.44538981119791665 2.0193277994791665
Ivy Bridge 3.0369270833333335 8.248611111111112
Sandy Bridge 2.1694411458333334 8.575357663690477
da-ekchajzer commented 2 years ago

The second table represents the medium consumption of a server depending on the CPU family (also called Architecture).

My idea is to generate server consumption profiles at first per CPU family, until we gather data on specific CPU (and other components) to make consumption profile based on more precise data (number of core, CPU model, ā€¦).

But either way we should come up with a generic way of generating a consumption profile from an workload object I mention above. As I saw in your graph you use a linear approach to connect a point with its successor. IMHO the approach is limited :

When few points are given (in the case of the second table for instance) the consumption profile will be an affine function

Example for Skylake family

"workload":{
"0":0.6446044454253452,
"100": 4.193436438541878
}
consumption_profil(x) =  ((100 - 0) / (4.193436438541878 - 0.6446044454253452)) * x + 0.6446044454253452 

Yet, we know the consumption profile is not linear.

Besides, with this approach, we won't come up with a function defined by its coefficient but with a set of affine function connecting a point to another.

Cloud carbon footprint uses an affine equation (as saw above) to come up with an average watts' consumption Average Watts (x) = Min Watts + x * (Max Watts - Min Watts)

I think with the AWS data from teads and the future data we'll have we can be more ambitious and generate more precise functions.

What do you think of using a logarithmic regression (which I am not familiar with) based on the workload object and previous consumption_profil ?

Does this make sense to you ?

samuelrince commented 2 years ago

I think a logarithmic function is good enough to model the CPU power consumption given the workload, and I understand that often we will only have the min (idle) and max (100%) power consumption.

The idea is to have a more precise model if we can have access to the CPU model. Precisely in Intel case, if we know the CPU is either Xeon Platinum, Xeon Gold, Xeon Silver we can use a different base model to compute the "Final model", based on min and max power consumption.

Here is an example:

We receive from the API, both the CPU model and workload as following:

{
  "cpu": {
    "model": "Intel Xeon Platinum 8124M"
  },
  "workload": {
    0: 51       <<< Power consumption in idle state (in W)
    100: 413    <<< Power consumption in 100% state (in W)
  }
}

(Maybe not the actual json fields here)

Given that we know it is a Xeon Platinum CPU and we can use a more precise model previously fitted on Xeon Platinum CPU data only, see the following:

image

Here the white curve called "Platinum model" is a power consumption model inferred from all power consumption curves for Xeon Platinum CPUs.

We can then build a second model called "Final model" using the "Platinum model" and min and max power consumption values. We build the following model in pink:

image

(In blue it is the actual cpu power consumption model)

In the case where we don't have the CPU model, but only min/max workload, we can use a default model (still a log function) built from the whole power consumption datasets. This method will give less precise values but still better than an affine function.

The log function I use to fit in the data is:

power_consumption(workload) = a * ln(b * (workload + c)) + d

I hope it is more clear what my idea is. Let me know if you think that it can be useful or if it is totally overkill.

da-ekchajzer commented 2 years ago

It is exactly what I was thinking but couldn't explain it so clearly. Thank you.

I think we should work with CPU family (architecture) / core number rather than the commercial naming (Xeon, ā€¦) for several reasons :

Could you explain the process with the equations to ease the implementation part. For example how do you define a,b,c,d in your equation ? Also, could we apply this mechanism when more than 2 values are given (0%, 50%, 100% for instance) ? Does it make sense ?

process summary

1 - Input data

{
"cpu" :{
  "family":"skylake",
  "nb_core":8
}
"workload":{
    0: 51,       <<< Power consumption in idle state (in W)
    50: 293,       <<< Power consumption in 50% state (in W)
    100: 413    <<< Power consumption in 100% state (in W)
}
}

2 - Look for equivalent consumption profile

If exist : Search for equivalent consumption profile with same family and core_number else if exist : Search for equivalent consumption profile with same family else : Use default consumption profile and go to (4)

3 - Infer the consumption profile for the current type of CPU

ā‡’ What magic are you doing here ?

4 - Generate the consumption profile equation from 1) the inferred curve and 2) the input data ā‡’ What magic are you doing here ?

power_consumption(workload) = a * ln(b * (workload + c)) + d

samuelrince commented 2 years ago

Could you explain the process with the equations to ease the implementation part. For example how do you define a,b,c,d in your equation ?

The implementation is really easy, it is just using scipy.optimize.curve_fit function to create all the previous models. Basically, it is an optimization problem where we try to fit a function (power_consumption(workload) = a * ln(b * (workload + c)) + d) to some data points. If we have multiple data points, we can just fit one model per CPU and then merge all the models into one, by averaging the parameters (a, b, c, d) of the models. I have only set the following constraints to parameters:

The optimization done in curve_fit to find a, b, c and d is least squares approximation.

I can provide you the POC in a notebook if you want? (I have to clean it a bit first).

Also, could we apply this mechanism when more than 2 values are given (0%, 50%, 100% for instance) ? Does it make sense ?

The optimization process described above can work with 2 or more values. With more values we can expect a higher precision. Depending on the number of data points, we have it can be useful to start the optimization process from a base model (like the Platinum model) in that way we start with parameters that are already defined and we can just "try to shift the curve" until it meets min workload and max workload for instance. But I think that if we have 3 or more data points in input we don't need that first step as the model we try to fit is very simple and regular.

I think we should work with CPU family (architecture) / core number rather than the commercial naming (Xeon, ā€¦) for several reasons : ...

I have tried to put the family (or architecture?) in front of each CPU, tell me if you see an error, but I think it is OK. It gives me that:

CPU model CPU family
Intel Xeon E-2278G Coffee Lake
Intel Xeon E3 1240v6 Sandy Bridge
Intel Xeon E5-2660 Sandy Bridge
Intel Xeon E5-2686 v4 Broadwell
Intel Xeon Gold 5120 Skylake
Intel Xeon Gold 5218 Cascade Lake
Intel Xeon Gold 6230R Cascade Lake
Intel Xeon Platinum 8124M Skylake
Intel Xeon Platinum 8151 Skylake
Intel Xeon Platinum 8175M Skylake
Intel Xeon Platinum 8252C Cascade Lake
Intel Xeon Platinum 8259CL Cascade Lake
Intel Xeon Platinum 8275CL Cascade Lake
Intel Xeon Silver 4110 Skylake
Intel Xeon Silver 4114 Skylake
Intel Xeon Silver 4210R Cascade Lake
Intel Xeon Silver 4214 Cascade Lake

(I've removed the ones with * for now because I don't understand why they look so weird on the graphs...)

Given that classification I can plot all CPU power consumption curves per family.

image image image

Only one CPU for both Coffee Lake and Broadwell so I haven't plotted them.

You see that on each graph we clearly have different CPU profiles even though they are from the same family/architecture. And on the first 2 graphs we can see that the Platinum ones are always close together at the top, then Gold ones, and then Silver ones at the bottom.

That is why I first grouped them by CPU "model" (Platinum, Gold, Silver, E3, E5, E) because when you plot them together there profiles look very similar even though they are not from the same family/architecture or launch year.

If we take into account the number of CPU cores in addition of the CPU architecture it is still not really satisfying:

image image image

(The number after the CPU full name is the number of cores, e.g. "Intel Xeon Platinum 8275CL 24" cores)

You have CPU with less cores over CPU with more cores and vice versa.

Let me know of what you think, maybe it is a subject to discuss in a meeting? But in the end, at this stage I am only convinced by grouping CPUs by their "model". Of course, if we have more data we can then consider architecture and number of cores, but only within the same CPU model group.

da-ekchajzer commented 2 years ago

Thank you for the explanations.

From your work it seems very clear that the CPU model is the best strategy. As you mentioned it would be nice to find data on other CPU (AMD for instance) to validate this.

@github-benjamin-davy since this strategy is based on your data I think your opinion would be precious.

I think a Jupyter notebook is a good input for the implementation if you can provide it.

I thought we could begin to implement it as a route (POST /cpu/consumption_profil) which takes a CPU object and a workload object and returns the coefficient a, b, c of the function.

The usage of the consumption profile will be implemented in #87 and #88

da-ekchajzer commented 2 years ago

It makes me think that we should add a model CPU attribute :

I will modify #82 to make it possible to complete family and model from cpu name.

github-benjamin-davy commented 2 years ago

Hello here, I'll try to catch up on the discussion,

@samuelrince the * is simply used as a way to exclude some lines from the VLOOKUP in the spreadsheet (some lines refer to underclocked machines).

I would say that the most essential characteristic of the CPU is its TDP which should most of the time be close to the max consumption from what I've experienced (however two CPUs with the same TDP might not have the exact same behavior). As you have seen, CPUs from the same family can have very different power consumption for the same number of cores (depends on voltage & frequency).

samuelrince commented 2 years ago

Hey @da-ekchajzer you can take a look at this notebook as a working implementation.

POC_cpu_workload_power_consumption.zip

da-ekchajzer commented 1 year ago

Implemented as a router for CPU in https://github.com/Boavizta/boaviztapi/pull/113