Closed samuelrince closed 10 months ago
p4d.24xlarge
(GPU):Instance:
- id: p4d.24xlarge
- vcpu: 96
- memory: 1152
- storage_type: ssd
- storage_units: 8
- storage_capacity: 1000
- gpu: 8
- platform: p4d.24xlarge
Platform (archetype):
- id: p4d.24xlarge
- manufacturer: AWS
- CASE.case_type: rack
- CPU.units: 2
- CPU.name: Intel Xeon Platinum 8275CL
- RAM.units: 36
- RAM.capacity: 32
- SSD.units: 8
- SSD.capacity: 1000
- HDD.units: 0
- HDD.capacity: 0
- GPU.units: 8
- GPU.name: NVIDIA A100
- GPU.memory: 40
- GPU.connector_type: sxm
- POWER_SUPPLY.units: 2
- POWER_SUPPLY: 2.99;1;5
- USAGE.time_workload: 50;0;100
- USAGE.use_time_ratio: 1
- USAGE.hours_life_time: 35040
- USAGE.other_consumption_ratio: 0.33;0.2;0.6
- USAGE.overcommitted: ??? (true/false or ratio)
- warnings: ...
d3.2xlarge
, storage optimized):Instance:
- id: d3.2xlarge
- vcpu: 8
- memory: 64
- storage_type: hdd
- storage_units: 6
- storage_capacity: 2000
- gpu: 0
- platform: ??? (see different alternatives)
We set plartform: d3.8xlarge
(biggest instance).
So instance d3.8xlarge
is:
- id: d3.8xlarge
- vcpu: 32
- memory: 256
- storage_type: hdd
- storage_units: 24
- storage_capacity: 2000
- gpu: 0
- platform: d3.8xlarge
Platform:
- id: d3.8xlarge
- manufacturer: AWS
- CASE.case_type: rack
- CPU.units: 2
- CPU.name: Intel Xeon Platinum 8259CL
- RAM.units: 8
- RAM.capacity: 32
- SSD.units: 0
- SSD.capacity: 0
- HDD.units: 24
- HDD.capacity: 2000
- GPU.units: 0
- GPU.name: null
- GPU.memory: 0
- GPU.connector_type: null
- POWER_SUPPLY.units: 2
- POWER_SUPPLY: 2.99;1;5
- USAGE.time_workload: 50;0;100
- USAGE.use_time_ratio: 1
- USAGE.hours_life_time: 35040
- USAGE.other_consumption_ratio: 0.33;0.2;0.6
- USAGE.overcommitted: ??? (true/false or ratio)
- warnings: ...
The platform described above has 2 units * 48 threads = 96 vcpu (source), but the instance itself only has 32 vcpu. Meaning that in terms of vcpu allocation we could fit 3 d3.8xlarge onto the underlying platform.
The platform described has 24 disks whereas d3.2xlarge
got 6 of them. So we can fit 4 d3.2xlarge
onto the underlying platform, but this is not coherent with the vcpu allocation (32 vcpu of the platform vs 4 vcpu * 4 instances = 16 vcpu)
We set platform: d3.platform
.
Notice that d3.platform
does not exist in AWS (only in BoaviztAPI archetype referential).
Platform (archetype):
- id: d3.platform
- manufacturer: AWS
- CASE.case_type: rack
- CPU.units: 2
- CPU.name: Intel Xeon Platinum 8259CL
- RAM.units: 12
- RAM.capacity: 64
- SSD.units: 0
- SSD.capacity: 0
- HDD.units: 72
- HDD.capacity: 2000
- GPU.units: 0
- GPU.name: null
- GPU.memory: 0
- GPU.connector_type: null
- POWER_SUPPLY.units: 2
- POWER_SUPPLY: 2.99;1;5
- USAGE.time_workload: 50;0;100
- USAGE.use_time_ratio: 1
- USAGE.hours_life_time: 35040
- USAGE.other_consumption_ratio: 0.33;0.2;0.6
- USAGE.overcommitted: ??? (true/false or ratio)
- warnings: ...
Platform has 96 vcpus
Can fit 3 x d3.8xlarge
in terms of vcpu
So the amount of memory should be 3 x d3.8xlarge
memory capacity = 3 instances 256 GB = 768 GB
We can assume that 768 GB = 12 banks 64 GB
So the number of disks should be 3 x d3.8xlarge
disks = 3 instances * 24 units = 72 units
And also works for the original instance:
Can fit 12 x d3.2xlarge
in terms of vcpu
So the amount of memory should be 12 x d3.2xlarge
memory capacity = 12 instances 64 GB = 768 GB (Still 12 banks 64 GB)
Matches the first assumption β
So the number of disks should be 12 x d3.8xlarge
disks = 12 instances * 6 units = 72 units
Matches the first assumption β
Any opinions @demeringo @github-benjamin-davy @JacobValdemar? π€
Thank you for submitting this proposal. Here are my initial thoughts dumped in random order π
I agree, the CSV file can be confusing to interact with.
I agree, one of the most frustrating things is data duplication between aws.csv
and cpu_specs.csv
.
I agree, there is a steep learning curve before being able to contribute with data. Some complexity is inherent (e.g. CPU properties are inherently complex to understand without prior knowledge), and some other complexity is unnecessary and should be mitigated (as you describe).
The proposed solution makes it more explicit that a cloud instance is a fraction of a platform. I like that.
The proposed solution de-duplicates data. I like that.
I don't know if it is intentional, but it seems you have removed some of fields from the platform which are currently in aws.csv
. I like that. To me it seems better to push for adding CPU details in cpu_specs.csv
instead of into aws.csv
.
Do you propose creating a separate file (e.g. platforms.csv
) for platforms, or should they continue to reside in aws.csv
?
Regarding tricky example d3.2xlarge
, I prefer the virtual platform because it provides a more consistent and comparable result.
Something else is that I think it feels "bad" to work inside a large CSV file like aws.csv
. I struggle with distinguishing columns from each other and identifying what "header" a value belongs to. However, I don't have any suggestion for a better solution. I have thought about JSON/YAML, but I think they would make too long files since each value use a row/line. However, it could be something to consider. Just a thought.
Thank you, @samuelrince, for detailing our discussion so well, and thank you, @JacobValdemar, for your feedback, which confirms the importance of this reflection.
Do you propose creating a separate file (e.g. platforms.csv) for platforms, or should they continue to reside in aws.csv?
In my opinion, we should use the server archetype CSV, which already has the necessary columns. This could allow contributors to add instances by identifying a nearby generic platform already in the file without having to add it.
Regarding tricky example d3.2xlarge, could the problem be that we assume that one platform host only one type of instance ? I think this issue will occur also for RAM and GPU.
Could the problem be solved by allocating the impacts component per component ?
- id: d3.8xlarge
- vcpu: 32
- memory: 256
- storage_type: hdd
- storage_units: 24
- storage_capacity: 2000
- gpu: 0
- platform: d3.8xlarge
Platform:
- id: d3.8xlarge
- manufacturer: AWS
- CASE.case_type: rack
- CPU.units: 2
- CPU.name: Intel Xeon Platinum 8259CL
- RAM.units: 8
- RAM.capacity: 32
- SSD.units: 0
- SSD.capacity: 0
- HDD.units: 24
- HDD.capacity: 2000
- GPU.units: 0
- GPU.name: null
- GPU.memory: 0
- GPU.connector_type: null
- POWER_SUPPLY.units: 2
- POWER_SUPPLY: 2.99;1;5
- USAGE.time_workload: 50;0;100
- USAGE.use_time_ratio: 1
- USAGE.hours_life_time: 35040
- USAGE.other_consumption_ratio: 0.33;0.2;0.6
- USAGE.overcommitted: ??? (true/false or ratio)
- warnings: ...
$$ Cpu{embodied} = \frac{instance.vcpu}{platform.vcpu} * CPU{embodied} $$
$$ RAM{embodied} = \frac{instance.memory}{platform.RAM.unitsplatform.RAM.capacity} RAM{embodied} $$
$$ SSD_{embodied} = \frac{instance.storagecapacity}{platform.SSD.unitsSSD.capacity} SSD{embodied} $$
If we find out that this solution is too complicated or not relevant, I would also prefer the virtual platform solution.
Thank you, both @JacobValdemar @da-ekchajzer for your feedback in such a short timing! π
Do you propose creating a separate file (e.g. platforms.csv) for platforms, or should they continue to reside in aws.csv?
In my opinion, we should use the server archetype CSV, which already has the necessary columns. This could allow contributors to add instances by identifying a nearby generic platform already in the file without having to add it.
I agree with @da-ekchajzer about the CSV for platforms, we should use the already existing one with server archetypes. If the file gets too big that it becomes an issue, we can still split it by cloud provider in the future.
Regarding tricky example d3.2xlarge, could the problem be that we assume that one platform host only one type of instance ? I think this issue will occur also for RAM and GPU.
Could the problem be solved by allocating the impacts component per component ?
On the subject of allocation by components, I like the idea, but I think we will struggle to make it happen in the v1, given the architecture (cf our previous ~2Β min~ 55Β min conversation). We probably need to do some ugly computation in the code to extract the impacts of given components from the server object.
Also, for this approach to work, I think we need to specify the "purpose" of an instance to decide on which component we are going to make the allocation (compute, storage, general purpose, etc). For instance, if we take g5 instances, there are SSDs, but the impact is clearly due to the compute part (GPU, CPU, RAM), so we should say that it is a compute instance. Then a compute instance with GPU, so should we allocate by GPU and/or CPU and/or RAM? I think it adds complexity, and we need to think this thoroughly.
Plus, I really don't know if it makes sense to have servers hosting different types of instances? Does it really exist? If I select the CPU of the d3 instance, I see that m5, r5, vt1, g4 also share the same CPU so it could be possible... And given that I think the virtual platform makes sense in that use case as well.
I am not entirely convinced by the virtual platform, but I find it easier to deal with, even though this is something we will probably have a hard time to fully automate (in terms of platform creation in the CSV file). And later (in v2?), we can maybe address the issue of component-wise allocation strategies.
What do you think?
Something else is that I think it feels "bad" to work inside a large CSV file like aws.csv. I struggle with distinguishing columns from each other and identifying what "header" a value belongs to. However, I don't have any suggestion for a better solution. I have thought about JSON/YAML, but I think they would make too long files since each value use a row/line. However, it could be something to consider. Just a thought.
On that subject, I feel that it can also be an obstacle to new contributions. On my side, for the research part, I open the CSV on GitHub and filter the rows, but we appending data at the end of the file I can see myself struggling with that as well. Maybe we can look out for an open source project that can expose a CSV file with a nice UI in the browser, for instance?
I am thinking of projects like instances.vantage.sh for example.
Plus, I really don't know if it makes sense to have servers hosting different types of instances? Does it really exist? If I select the CPU of the d3 instance, I see that m5, r5, vt1, g4 also share the same CPU so it could be possible... And given that I think the virtual platform makes sense in that use case as well.
I was more thinking of a same "type" of instance but different level of resources. I think that in some case, the different resources (RAM,vCPU, SSD, GPU) doesn't scale linearly. Is that the problem for d3.8xlarge ?
We probably need to do some ugly computation in the code to extract the impacts of given components from the server object.
I think it would be easy to implement, but may be more complicated to explain/document. We would need to apply a ratio on each component during the impacts aggregation. The ratio would be computed for each component from the platform and instance data.
I just cannot figure out if that will solve our problem.
Plus, I really don't know if it makes sense to have servers hosting different types of instances? Does it really exist? If I select the CPU of the d3 instance, I see that m5, r5, vt1, g4 also share the same CPU so it could be possible... And given that I think the virtual platform makes sense in that use case as well.
I was more thinking of a same "type" of instance but different level of resources. I think that in some case, the different resources (RAM,vCPU, SSD, GPU) doesn't scale linearly. Is that the problem for d3.8xlarge ?
Well, it's not only about the allocation, but also about how to choose the platform instance in that case.
The premise here is to guess the total amount of vCPU of the platform. One CPU (Intel Xeon Platinum 8259CL) has 48 vcpu. So it's enough to fit 1 x d3.8x + 1 x d3.4x. But it could be possible (and highly probable in my opinion) that we have 2 CPUs. If that's the case, how do we guess the scaling of the other components (RAM and HDDs here) for the platform? That is what I proposed in the Alternative 2 of the previous comment. Use the most probable config in terms of vcpus of the platform, then hint the rest based on trying to fit N times the same instance (ideally the biggest one and checking if it works with other variants)
I remember from our discussion that you indeed mentioned that it was not that difficult to add allocation based on components. I think the best way to answer this is to test.
The platform can fit 1x d3.8x + 1x d3.4x, meaning we can deduce the following minimal configuration:
Platform:
If we compute the embodied impacts of the platform, we have:
Platform impacts:
- gwp: 2200 kgCO2eq
- adp: 0.13 kgSbeq
- pe: 24000 MJ
Meaning that we can now compute the impacts of d3.8x and d3.4x instances, by vcpu only or by all components.
Instance has 32 vcpu so, 32/48 of total embodied impacts.
d3.8x instance impacts:
- gwp: 1467 kgCO2eq
- adp: 0.087 kgSbeq
- pe: 16000 MJ
Instance has 16 vcpu so, 16/48 of total embodied impacts.
d3.4x instance impacts:
- gwp: 733 kgCO2eq
- adp: 0.043 kgSbeq
- pe: 8000 MJ
Instance has 32 vcpu, 256 GB or RAM and 24 disks
d3.8x instance impacts:
- gwp: 18 + 473 + 747 + 100 + 100 + 4.45 = 1442 kgCO2eq
- adp: 0.0136 + 0.02 + 0.006 + 0.033 + 0.0135 = 0.0861 kgSbeq
- pe: 260 + 6000 + 6624 + 1400 + 1467 + 46 = 15797 MJ
We are very close to the impacts with the previous calculation.
Not doing this one, sorry.
The platform can fit 3x d3.8x, meaning we can deduce the following minimal configuration:
Platform:
If we compute the embodied impacts of the platform, we have:
Platform impacts:
- gwp: 4100 kgCO2eq
- adp: 0.19 kgSbeq
- pe: 44000 MJ
Meaning that we can now compute the impacts of d3.8x and d3.4x instances by vcpu only or by all components.
Instance has 32 vcpu so, 32/96 of total embodied impacts.
d3.8x instance impacts:
- gwp: 1366 kgCO2eq
- adp: 0.063 kgSbeq
- pe: 14666 MJ
Instance has 16 vcpu so, 16/96 of total embodied impacts.
d3.4x instance impacts:
- gwp: 683 kgCO2eq
- adp: 0.0317 kgSbeq
- pe: 7333 MJ
d3.8x instance impacts:
- gwp: 18 + 467 + 746 + 100 + 100 + 44 + 4.45 = 1479 kgCO2eq
- adp: 0.0136 + 0.0197 + 0.006 + 0.033 + 0.0135 + 0.00246 = 0.088 kgSbeq
- pe: 263 + 6000 + 6623 + 1400 + 1467 + 557 + 46 = 16356 MJ
Not doing this one, again.
TL;DR: I think we are doing overengineering. π
Of course, here, the scenario one, is kind of scaled based on the vcpu again, so I am not surprised by that result. If you want to test another configuration, feel free to try. But given the margins, I think it's overkilled.
Thank you very much for doing this exercise.
So from what you say, using an allocation on vcpu or for each component is not an important question from the moment the platforms are built accordingly ?
I'm sorry, but I've just managed to identify what's bothering me. The problem by doing so would be that the following data would never be used in the impacts' calculation (but will be used by contributors to construct the platform).
- memory: 256
- storage_type: hdd
- storage_units: 24
- storage_capacity: 2000
- gpu: 0
This puts the complexity in the platform's building, and it reduces the importance of instance data, which are the most important to consider.
I would have liked contributor to be able to associate an instance with a generic platform when they do not know how to build platforms. If the allocation is made for each component, the API would only allocate the RAM/Storage/CPU/GPU impacts for the instance based on its information. By doing so, we ensure that all reserve resources are accounted for, even if a generic platform is used.
In case, we put a great effort in building platforms based on instance information it doesn't change anything (as you have shown), in case a generic platform is used it avoids totally incoherent evaluations.
TL;DR: Our families are missing us
So from what you say, using an allocation on vcpu or for each component is not an important question from the moment the platforms are built accordingly ?
Well, yes, but only if you make an "educated" guess based on vcpu. In other scenarios, it's not the case.
I agree with you on the fact that we don't use the instance's specs, and in that case, we shouldn't even bother to ask the user to input it! π
So I have made a notebook to quickly test different platforms and instances.
Here is an example:
Platform:
- CPU.units: 2
- CPU.cores: 24
- RAM.units: 8
- RAM.capacity: 64 GB
- HDD.units: 24
- HDD.capacity: 14000 GB
- vcpu: 96
- memory: 512 GB
- hdd_storage: 336000 GB
Instance:
- vcpu: 32
- memory: 256 GB
- hdd_units: 24
- hdd_capacity: 2000 GB
- hdd_storage: 48000 GB
PLATFORM IMPACTS
| CPU | RAM | HDD | Others | Total
------------ | ------------ | ------------ | ------------ | ------------ | ------------
gwp | 54.00 | 900.00 | 746.40 | 372.78 | 2073.18
adp | 0.04 | 0.04 | 0.01 | 0.07 | 0.16
pe | 790.00 | 12000.00 | 6624.00 | 5204.60 | 24618.60
INSTANCE IMPACTS
* Allocation by vcpu
| CPU | RAM | HDD | Others | Total
------------ | ------------ | ------------ | ------------ | ------------ | ------------
gwp | 18.00 | 300.00 | 248.80 | 124.26 | 691.06
adp | 0.01 | 0.01 | 0.00 | 0.02 | 0.05
pe | 263.33 | 4000.00 | 2208.00 | 1734.87 | 8206.20
* Allocation by components
| CPU | RAM | HDD | Others | Total
------------ | ------------ | ------------ | ------------ | ------------ | ------------
gwp | 18.00 | 450.00 | 106.63 | 124.26 | 698.89
adp | 0.01 | 0.02 | 0.00 | 0.02 | 0.06
pe | 263.33 | 6000.00 | 946.29 | 1734.87 | 8944.49
* Equivalent server
| CPU | RAM | HDD | Others | Total
------------ | ------------ | ------------ | ------------ | ------------ | ------------
gwp | 22.00 | 470.00 | 93.30 | 372.78 | 958.08
adp | 0.02 | 0.02 | 0.00 | 0.07 | 0.12
pe | 330.00 | 5900.00 | 828.00 | 5204.60 | 12262.60
The additional "equivalent server" is here to compare what would be the impact of a probable server that has the same characteristics of the instance.
I invite you to test the notebook (it is on Google Colab and editable by anyone, I have a local copy).
This made me change my mind, I think we need to do the allocation by components, it makes more sense, and usually, it's closer to "my expected reality" (whatever that means).
Also, I think we will probably need to add more archetypes based on what exists in the wild and with smaller min/max ranges so that it makes sense.
I think we can make the following archetypes:
With the following variants:
For instance, a d3en.12xlarge
is a monster of 24 x 14 = 336 TB, we currently don't have archetypes with that kind of storage.
TL;DR: You were right since the beginning. π π π
Perfect! I will work on the implementation during the following days. Do you think that you could handle the addition of AWS platforms and existing instances with the right format ? @JacobValdemar since you have made the file in the first place, you might be of help on that also.
Sure, just reach out if there is anything
I started to reference all the instances with the new format and link them with platforms (and "virtual platforms" when we don't know).
https://docs.google.com/spreadsheets/d/1EmXYTUx0Nmmubj96_-fTThu7UK16Og-gcSqSOl7qB3c/edit?usp=sharing
I still need to run some checks on this file and then create the virtual platforms.
Problem
We want to make the process of adding cloud instances as simple as possible, while:
Resulting in describing both instance characteristics and platform (or bare metal) characteristics.
As of today, both are stored under the same CSV file (cloud archetypes) and can be confusing to interact with. First, it leads to duplicated data about some components (especially CPU with cpu_specs.csv). Second, the contributor needs to understand complex concepts to be able to make a new submission (e.g. difference between vcpu, platform_vcpu, CPU.core_units * CPU.units and USAGE.instance_per_server based on vcpu counts).
Solution
We propose a new way to add cloud instances that should clarify this process. We will separate the concept of a cloud instance and platform (or bare-metal server).
A cloud instance will be described with very few fields that are close to the description provided by cloud providers.
Example of a
c5.2xlarge
(in newaws.csv
):The platform defined here is
c5.18xlarge
is another cloud instance AND a server archetype defined as follows:Cloud instance (also in new
aws.csv
):Platform (server archetype):
In this description, the embodied impacts of the cloud instance can be derided from this operation:
$$ Instance{embodied} = \frac{instance.vcpu}{platform.vcpu} * Platform{embodied} $$
We will need to introduce the notion of "vcpu" in
CPU
modeling so that we can take into account the number of "threads" or "virtual cores" in hyper-threading scenarios. Like following:$$ \text{platform.vcpu} = \text{Platform.CPU.units} * \text{Platform.CPU.vcpu} $$
OR:
$$ \text{platform.vcpu} = \text{Platform.CPU.units} \text{Platform.CPU.core\_units} \text{(\# virtual cores per core)} $$
Happy to hear about your feedback @da-ekchajzer. I will detail other examples below.