Provide a way to generate function based on punctual consumption profile

da-ekchajzer commented 2 years ago

Problem

A consumption profile is a function which associate a workload to an electrical_consumption : consumption_profil(workload) = electrical_consumption

This continues function will be generated from punctual measures on different workload for a given configuration. The punctual measure could come from our measurement or from secondary sources.

We want to provide a way to generate continuous consumption profiles (function) from those punctual measures. Such process could be used for device or components usage impacts evaluation.

Solution

We should set up a regression process. We call regression the process of defining a continuous relationship (function) between workload and electrical_consumption based on a punctual measurement.

Regression shouldn't be linear. From what we have seen, consumption profile follow a logarithmic rule.

This might be a problem when only two points are given (min 0%, max 100% for instance) since we don't want a linear distribution. We could use existing consumption profile in the regression process.

Input value

Format

We should have this type of input format :

"workload":{
  "10":30.6,
  "20":34,
  "56":67,
  ...
  "workload %": "power_consumption"
}

Data example

Example for AWS server CPU from TEADS Link :

PkgWatt Idle (0%)	PkgWatt CPUStress 10%	PkgWatt CPUStress 20%	PkgWatt CPUStress 30%	PkgWatt CPUStress 40%	PkgWatt CPUStress 50%	PkgWatt CPUStress 60%	PkgWatt CPUStress 70%	PkgWatt CPUStress 80%	PkgWatt CPUStress 90%	PkgWatt CPUStress 100%
51	135	174	212	249	293	330	357	382	404	413
113	146	194	225	244	263	295	295	311	333	387
58	176	241	299	375	448	520	562	590	607	617
59	174	243	299	372	441	522	564	592	606	617
116	148	188	205	222	258	240	277	284	287	346
55	138	178	224	272	307	344	375	401	426	440
110	127	150	188	214	224	224	241	244	240	287
58	147	193	246	298	344	381	417	453	481	492
48	148	198	245	305	361	389	413	447	471	480
57	147	212	270	331	381	417	453	481	513	513
35	100	126	152	178	205	223	238	250	263	272
2	10	16	17	15	15	15	38	40	42	41
26	65	83	96	109	117	122	127	130	134	133
50	98	120	141	159	172	181	190	195	204	207
28	57	88	110	125	136	146	151	156	157	168
32	62	79	91	103	112	120	127	134	138	145
40	64	77	86	92	98	104	107	110	114	120
2	22	41	59	44	41	39	47	90	82	85
17	68	93	106	117	128	135	140	146	150	154
71	107	132	154	167	179	184	195	199	204	204
38	113	134	160	178	203	221	234	243	247	252

Example from specpower agregated by Cloud carbon footprint : Link

Architecture	Min Watts (0%)	Max Watts (100%)
Skylake	0.6446044454253452	4.193436438541878
Broadwell	0.7128342245989304	3.6853275401069516
Haswell	1.9005681818181814	6.012910353535353
EPYC 2nd Gen	0.4742621527777778	1.6929615162037037
Cascade Lake	0.6389493581523519	3.9673047343937564
EPYC 3rd Gen	0.44538981119791665	2.0193277994791665
Ivy Bridge	3.0369270833333335	8.248611111111112
Sandy Bridge	2.1694411458333334	8.575357663690477

Output value

A function described by its coefficients

da-ekchajzer commented 2 years ago

@samuelrince I would be interested in your opinion.

samuelrince commented 2 years ago

I have deep dived a little bit more using the original spreadsheet. Looks like making one logarithm-like function model per CPU "model" or "family" (Platinum, Gold, Silver, etc.) could be a good idea?

Platinum

Here the red one is the line 6 from the spreadsheet, corresponding to c5.metal*. I don't know what the star means here neither why the red curve stands out that much. Looks like it should be on the other graph.

Gold

Silver

E, E3, E5

@da-ekchajzer, let me known what you think of this approach. I guess if can have the CPU model name we will have a better estimation of what the power consumption profile should look like for new CPU models.

samuelrince commented 2 years ago

Also I don't understand the second table, it's not the power consumption per CPU architecture right?

Architecture Min Watts (0%) Max Watts (100%)

Skylake 0.6446044454253452 4.193436438541878

Broadwell 0.7128342245989304 3.6853275401069516

Haswell 1.9005681818181814 6.012910353535353

EPYC 2nd Gen 0.4742621527777778 1.6929615162037037

Cascade Lake 0.6389493581523519 3.9673047343937564

EPYC 3rd Gen 0.44538981119791665 2.0193277994791665

Ivy Bridge 3.0369270833333335 8.248611111111112

Sandy Bridge 2.1694411458333334 8.575357663690477

da-ekchajzer commented 2 years ago

The second table represents the medium consumption of a server depending on the CPU family (also called Architecture).

My idea is to generate server consumption profiles at first per CPU family, until we gather data on specific CPU (and other components) to make consumption profile based on more precise data (number of core, CPU model, …).

But either way we should come up with a generic way of generating a consumption profile from an workload object I mention above. As I saw in your graph you use a linear approach to connect a point with its successor. IMHO the approach is limited :

When few points are given (in the case of the second table for instance) the consumption profile will be an affine function

Example for Skylake family

"workload":{
"0":0.6446044454253452,
"100": 4.193436438541878
}
consumption_profil(x) =  ((100 - 0) / (4.193436438541878 - 0.6446044454253452)) * x + 0.6446044454253452

Yet, we know the consumption profile is not linear.

Besides, with this approach, we won't come up with a function defined by its coefficient but with a set of affine function connecting a point to another.

Cloud carbon footprint uses an affine equation (as saw above) to come up with an average watts' consumption Average Watts (x) = Min Watts + x * (Max Watts - Min Watts)

I think with the AWS data from teads and the future data we'll have we can be more ambitious and generate more precise functions.

What do you think of using a logarithmic regression (which I am not familiar with) based on the workload object and previous consumption_profil ?

Does this make sense to you ?

samuelrince commented 2 years ago

I think a logarithmic function is good enough to model the CPU power consumption given the workload, and I understand that often we will only have the min (idle) and max (100%) power consumption.

The idea is to have a more precise model if we can have access to the CPU model. Precisely in Intel case, if we know the CPU is either Xeon Platinum, Xeon Gold, Xeon Silver we can use a different base model to compute the "Final model", based on min and max power consumption.

Here is an example:

We receive from the API, both the CPU model and workload as following:

{
  "cpu": {
    "model": "Intel Xeon Platinum 8124M"
  },
  "workload": {
    0: 51       <<< Power consumption in idle state (in W)
    100: 413    <<< Power consumption in 100% state (in W)
  }
}

(Maybe not the actual json fields here)

Given that we know it is a Xeon Platinum CPU and we can use a more precise model previously fitted on Xeon Platinum CPU data only, see the following:

Here the white curve called "Platinum model" is a power consumption model inferred from all power consumption curves for Xeon Platinum CPUs.

We can then build a second model called "Final model" using the "Platinum model" and min and max power consumption values. We build the following model in pink:

(In blue it is the actual cpu power consumption model)

In the case where we don't have the CPU model, but only min/max workload, we can use a default model (still a log function) built from the whole power consumption datasets. This method will give less precise values but still better than an affine function.

The log function I use to fit in the data is:

power_consumption(workload) = a * ln(b * (workload + c)) + d

I hope it is more clear what my idea is. Let me know if you think that it can be useful or if it is totally overkill.

da-ekchajzer commented 2 years ago

It is exactly what I was thinking but couldn't explain it so clearly. Thank you.

I think we should work with CPU family (architecture) / core number rather than the commercial naming (Xeon, …) for several reasons :

There are fewer architectures which make it easyer to have exhaustive data
We already implemented a first classification per CPU family and core number
Soon will be able to automatically smart-complete CPU family from commercial naming (#82)
2 commercial naming can be given to the same CPU depending on its usage (server, laptop, …)

Could you explain the process with the equations to ease the implementation part. For example how do you define a,b,c,d in your equation ? Also, could we apply this mechanism when more than 2 values are given (0%, 50%, 100% for instance) ? Does it make sense ?

process summary

1 - Input data

{
"cpu" :{
  "family":"skylake",
  "nb_core":8
}
"workload":{
    0: 51,       <<< Power consumption in idle state (in W)
    50: 293,       <<< Power consumption in 50% state (in W)
    100: 413    <<< Power consumption in 100% state (in W)
}
}

2 - Look for equivalent consumption profile

If exist : Search for equivalent consumption profile with same family and core_number else if exist : Search for equivalent consumption profile with same family else : Use default consumption profile and go to (4)

3 - Infer the consumption profile for the current type of CPU

⇒ What magic are you doing here ?

4 - Generate the consumption profile equation from 1) the inferred curve and 2) the input data ⇒ What magic are you doing here ?

power_consumption(workload) = a * ln(b * (workload + c)) + d

samuelrince commented 2 years ago

Could you explain the process with the equations to ease the implementation part. For example how do you define a,b,c,d in your equation ?

The implementation is really easy, it is just using scipy.optimize.curve_fit function to create all the previous models. Basically, it is an optimization problem where we try to fit a function (power_consumption(workload) = a * ln(b * (workload + c)) + d) to some data points. If we have multiple data points, we can just fit one model per CPU and then merge all the models into one, by averaging the parameters (a, b, c, d) of the models. I have only set the following constraints to parameters:

a > 0
b > 0
c > 0

The optimization done in curve_fit to find a, b, c and d is least squares approximation.

I can provide you the POC in a notebook if you want? (I have to clean it a bit first).

Also, could we apply this mechanism when more than 2 values are given (0%, 50%, 100% for instance) ? Does it make sense ?

The optimization process described above can work with 2 or more values. With more values we can expect a higher precision. Depending on the number of data points, we have it can be useful to start the optimization process from a base model (like the Platinum model) in that way we start with parameters that are already defined and we can just "try to shift the curve" until it meets min workload and max workload for instance. But I think that if we have 3 or more data points in input we don't need that first step as the model we try to fit is very simple and regular.

I think we should work with CPU family (architecture) / core number rather than the commercial naming (Xeon, …) for several reasons : ...

I have tried to put the family (or architecture?) in front of each CPU, tell me if you see an error, but I think it is OK. It gives me that:

CPU model	CPU family
Intel Xeon E-2278G	Coffee Lake
Intel Xeon E3 1240v6	Sandy Bridge
Intel Xeon E5-2660	Sandy Bridge
Intel Xeon E5-2686 v4	Broadwell
Intel Xeon Gold 5120	Skylake
Intel Xeon Gold 5218	Cascade Lake
Intel Xeon Gold 6230R	Cascade Lake
Intel Xeon Platinum 8124M	Skylake
Intel Xeon Platinum 8151	Skylake
Intel Xeon Platinum 8175M	Skylake
Intel Xeon Platinum 8252C	Cascade Lake
Intel Xeon Platinum 8259CL	Cascade Lake
Intel Xeon Platinum 8275CL	Cascade Lake
Intel Xeon Silver 4110	Skylake
Intel Xeon Silver 4114	Skylake
Intel Xeon Silver 4210R	Cascade Lake
Intel Xeon Silver 4214	Cascade Lake

(I've removed the ones with * for now because I don't understand why they look so weird on the graphs...)

Given that classification I can plot all CPU power consumption curves per family.

Only one CPU for both Coffee Lake and Broadwell so I haven't plotted them.

You see that on each graph we clearly have different CPU profiles even though they are from the same family/architecture. And on the first 2 graphs we can see that the Platinum ones are always close together at the top, then Gold ones, and then Silver ones at the bottom.

That is why I first grouped them by CPU "model" (Platinum, Gold, Silver, E3, E5, E) because when you plot them together there profiles look very similar even though they are not from the same family/architecture or launch year.

If we take into account the number of CPU cores in addition of the CPU architecture it is still not really satisfying:

(The number after the CPU full name is the number of cores, e.g. "Intel Xeon Platinum 8275CL 24" cores)

You have CPU with less cores over CPU with more cores and vice versa.

Let me know of what you think, maybe it is a subject to discuss in a meeting? But in the end, at this stage I am only convinced by grouping CPUs by their "model". Of course, if we have more data we can then consider architecture and number of cores, but only within the same CPU model group.

da-ekchajzer commented 2 years ago

Thank you for the explanations.

From your work it seems very clear that the CPU model is the best strategy. As you mentioned it would be nice to find data on other CPU (AMD for instance) to validate this.

@github-benjamin-davy since this strategy is based on your data I think your opinion would be precious.

I think a Jupyter notebook is a good input for the implementation if you can provide it.

I thought we could begin to implement it as a route (POST /cpu/consumption_profil) which takes a CPU object and a workload object and returns the coefficient a, b, c of the function.

The usage of the consumption profile will be implemented in #87 and #88

da-ekchajzer commented 2 years ago

It makes me think that we should add a model CPU attribute :

family (skylake, rome, …)
model (Ryzen 3, Xeon Gold, Xeon E, Core i7, …)
name (Xeon Platinum 8153, ...)

I will modify #82 to make it possible to complete family and model from cpu name.

github-benjamin-davy commented 2 years ago

Hello here, I'll try to catch up on the discussion,

@samuelrince the * is simply used as a way to exclude some lines from the VLOOKUP in the spreadsheet (some lines refer to underclocked machines).

I would say that the most essential characteristic of the CPU is its TDP which should most of the time be close to the max consumption from what I've experienced (however two CPUs with the same TDP might not have the exact same behavior). As you have seen, CPUs from the same family can have very different power consumption for the same number of cores (depends on voltage & frequency).

samuelrince commented 2 years ago

Hey @da-ekchajzer you can take a look at this notebook as a working implementation.

POC_cpu_workload_power_consumption.zip

da-ekchajzer commented 1 year ago

Implemented as a router for CPU in https://github.com/Boavizta/boaviztapi/pull/113

Boavizta / boaviztapi