Boavizta / boaviztapi

🛠 Giving access to BOAVIZTA reference data and methodologies trough a RESTful API
GNU Affero General Public License v3.0
66 stars 23 forks source link

Missing instance types in aws.csv #232

Closed JacobValdemar closed 7 months ago

JacobValdemar commented 9 months ago

Bug description

aws.csv is missing around 300 instance types. See this gist.

I would like to help adding the missing instance types, but I can't seem to figure out where the data come from or if there is a script used for adding new types or updating the file. A description of how you normally discover the relevant data for the instance types would be incredibly helpful.

To Reproduce

Expected behavior

That all instance types was in the file.

JSON OUTPUT

Additional context

da-ekchajzer commented 9 months ago

Thanks for your wish to contribute !

Process for adding instances

id manufacturer CASE.case_type year vcpu platforme_vcpu CPU.units CPU.core_units CPU.name CPU.manufacturer CPU.model_range CPU.family CPU.tdp CPU.manufacture_date instance.ram_capacity RAM.capacity RAM.units SSD.units SSD.capacity HDD.units GPU.name GPU.units GPU.TDP GPU.memory_capacity POWER_SUPPLY.units POWER_SUPPLY.unit_weight USAGE.instance_per_server USAGE.time_workload USAGE.use_time_ratio USAGE.hours_life_time USAGE.other_consumption_ratio USAGE.overcommited Warnings configuration.disk.units configuration.disk.type configuration.disk.capacity
a1.2xlarge AWS rack 2018 8 16 1 16 Graviton Annapurna Labs Graviton Graviton 40 2018 16 16 2 0 0 0 2;2;2 2.99;1;5 2 50;0;100 1 35040 0.33;0.2;0.6 0 0

Datasource

Most configuration information should be reported in AWS documentation or third party services. See for c5.12xlarge :

Automatization

We don't have an automatic process to fulfill our database, if you wish to do it automatically, feel free to provider a Jupyter notebook.

Hops it answers all your questions. Feel free to ask others.

JacobValdemar commented 9 months ago

@da-ekchajzer thanks for the description! So basically I should just leave empty if I can't find the data?

Existing data

I have looked at the existing data to see if I can understand how existing data is extrapolated from external data sources. However, it seems that there is some errors in existing data. Can that be true? If there are errors in existing data, should I then fix them? Also it seems that for many instances, defaults are hardcoded into the datasheet, shouldn't they be omitted as you said? For example, POWER_SUPPLY.units is specified as 2;2;2 for all rows.

Also, should Previous Generation Instances be removed? Such as c1.medium and c1.xlarge?

New data I add

I have tried to create a table describing where to get data for the columns. Does it seem right? All the places I have put just ? means I am currently unable to figure out where/how to get the data. column constant instances.vantage.sh data export other source
id "API Name"
manufacturer AWS
CASE.case_type rack
year https://instancetyp.es/
vcpu "vCPUs"
platforme_vcpu ?
CPU.units ?
CPU.core_units ?
CPU.name "Physical Processor", extrapolated
CPU.manufacturer "Physical Processor", extrapolated
CPU.model_range "Physical Processor", extrapolated
CPU.family aws webpage instance type description
CPU.tdp lookup crowdsourcing/cpu_specs
CPU.manufacture_date same as year? or cpu_specs.release_date?
instance.ram_capacity "Instance Memory"
RAM.capacity ? "Instance Memory" !=/= RAM.capacity * RAM.units / usage.instance_per_server ? existing data doesn't seem valid if
RAM.units ?
SSD.units "Instance Storage", extrapolated aws webpage instance type table
SSD.capacity "Instance Storage", extrapolated aws webpage instance type table
HDD.units ?
HDD.units ?
GPU.name "GPU model"
GPU.units "GPUs"
GPU.TDP ?
GPU.memory_capacity BAD or "GPU memory" / "GPUs" or "GPU memory"
POWER_SUPPLY.units 2;2;2 why?
POWER_SUPPLY.unit_weight 2.99;1;5 why?
USAGE.instance_per_server metal.vCPU / "vCPUs", does apparently not go for all types
USAGE.time_workload 50;0;100 ? why?? is it default;min;max?
USAGE.use_time_ratio 1
USAGE.hours_life_time 35040 (4 years, but many has manufacture_date < 2019?)
USAGE.other_consumption_ratio 0.33;0.2;0.6 (PUE?) source? what is x;y;z ?
USAGE.overcommited 0 or 1, why which?
Warnings
configuration.disk.units ?
configuration.disk.type ?
configuration.disk.capacity ?
da-ekchajzer commented 9 months ago

So basically I should just leave empty if I can't find the data?

Yes and no :). I was a little to quick on my response. I would say as a first step yes, and then we will discuss on how to account for unknown data (either default value or range)

Range (value;min;max)

To address uncertainty in situations where a value is not known or cannot be determined, we employ a default value accompanied by a minimum and maximum range. These parameters are utilized in the impact calculation procedure to evaluate a spectrum of potential impacts, including average, minimum, and maximum values. When specifying a range for a value within a CSV file, we format it as follows: value;min;max.

Existing data

I have looked at the existing data to see if I can understand how existing data is extrapolated from external data sources. However, it seems that there is some errors in existing data. Can that be true?

It is definitely possible that some data are wrong either because of inadvertent errors in the manual process or because data have changed. Feel free to generate an other file that follow the same format.

If there are errors in existing data, should I then fix them?

Yes feel free ! You can also (if possible) add a source row to track where the data come from.

Also it seems that for many instances, defaults are hardcoded into the datasheet, shouldn't they be omitted as you said? For example, POWER_SUPPLY.units is specified as 2;2;2 for all rows.

Yes, some mandatory data are hard-coded. You can leave them blank when you are not sure, and I will complete them on your PR. Don't hesitate to ask on a case-by-case basis if you want to understand our assumptions.

Regarding power supply, we assume that there are 2 units. I invite you to keep this assumption if you have no information to the contrary.

New data I add

What is important to understand is that the components described in the file correspondent to the all machine/platform (which is equivalent to the metal version of the EC2 type).

column Comments constant instances.vantage.sh data export other source
id yes "API Name"
manufacturer yes AWS
CASE.case_type yes rack
year yes but not used in the calculation, so I wouldn't make this data mandatory https://instancetyp.es/
vcpu yes "vCPUs"
platforme_vcpu plarform == metal : metal.vCPU ?
CPU.units Correspond to the number of CPU for the platform. You can retrieve the number of CPU of the platform from the number of vCPU for a given CPU name and the number of CPU of the platform : platform_vcpu / nb_vcpu(cpu_name) ?
CPU.core_units Numbers of cores per CPU (usually 1 core == 2vCPUs), if not provided will be completed from crowdsourcing/cpu_specs ?
CPU.name Yes, should match a CPU in crowdsourcing/cpu_specs "Physical Processor", extrapolated
CPU.manufacturer Yes, if not provided will be completed from crowdsourcing/cpu_specs "Physical Processor", extrapolated
CPU.model_range Yes, if not provided will be completed from crowdsourcing/cpu_specs "Physical Processor", extrapolated
CPU.family Yes, if not provided will be completed from crowdsourcing/cpu_specs (in cpu_specs family == code_name) aws webpage instance type description
CPU.tdp Yes, if not provided will be completed from crowdsourcing/cpu_specs lookup crowdsourcing/cpu_specs
CPU.manufacture_date yes but not used in the calculation, so I wouldn't make this data mandatory same as year? or cpu_specs.release_date?
instance.ram_capacity yes "Instance Memory"
RAM.capacity See ### RAM ? "Instance Memory" !=/= RAM.capacity * RAM.units / usage.instance_per_server ? existing data doesn't seem valid if
RAM.units See ### RAM ?
SSD.units Yes only if the instance host a SSD (no EBS) "Instance Storage", extrapolated aws webpage instance type table
SSD.capacity Yes only if the instance host a SSD (no EBS) "Instance Storage", extrapolated aws webpage instance type table
HDD.units I think no instance has an HDD ?
HDD.units I think no instance has an HDD ?
GPU.name Yes, GPU are not taken into account for now but will be soon so feel free to collect this data "GPU model"
GPU.units Yes, GPU are not taken into account for now but will be soon so feel free to collect this data "GPUs"
GPU.TDP Will be completed from name in future versions ?
GPU.memory_capacity Memory for 1 GPU BAD or "GPU memory" / "GPUs" or "GPU memory"
POWER_SUPPLY.units 1 + 1 backup 2;2;2 why?
POWER_SUPPLY.unit_weight Since we don't know this data we have a range between 1 and 5 with an average of 2.99 2.99;1;5 why?
USAGE.instance_per_server Yes. I would be interested to see the errors metal.vCPU / "vCPUs", does apparently not go for all types
USAGE.time_workload Yes. If not specified by users when the instance is requested we will compute it for 0%, 50% and 100M 50;0;100 ? why?? is it default;min;max?
USAGE.use_time_ratio Yes, it means that its up 100% of the time. 1
USAGE.hours_life_time It is a theoretical life duration. Used to allocate the embedded impacts. 35040 (4 years, but many has manufacture_date < 2019?)
USAGE.other_consumption_ratio We only model the consumption of CPU and RAM. This ratio it used to account for the consumption of other components. 0.33;0.2;0.6 (PUE?) source? what is x;y;z ?
USAGE.overcommited Boolean. Not used for now. If overcommitted a vCPU might be shared. Its impacts should also be shared. 0 or 1, why which?
Warnings
configuration.disk.units Legacy. It has been removed ?
configuration.disk.type Legacy. It has been removed ?
configuration.disk.capacity Legacy. It has been removed ?

RAM

For the ram we need to know both the numer of strip and the capacity of each strip for the platform. What we have usually is the total quantity of RAM for the EC2 metal. In this case, we're looking for a logical distribution of ram strips (the maximal capacity possible between 8 GB, 16GB, 32GB, 128GB). Example :

r6g.metal : 512.0 GB ==> 2*128 GB

It's very unscientific, but it's the best we can do. Hence the warning "RAM.capacity not verified".

Feel free to join our public chat if you wan't to discuss or to launch a synchronous call : https://chat.boavizta.org/signup_user_complete/?id=97a1cpe35by49jdc66ej7ktrjc

da-ekchajzer commented 7 months ago

Missing instance have been added in PR https://github.com/Boavizta/boaviztapi/pull/237