Missing instance types in aws.csv

JacobValdemar commented 9 months ago

Bug description

aws.csv is missing around 300 instance types. See this gist.

I would like to help adding the missing instance types, but I can't seem to figure out where the data come from or if there is a script used for adding new types or updating the file. A description of how you normally discover the relevant data for the instance types would be incredibly helpful.

To Reproduce

Expected behavior

That all instance types was in the file.

JSON OUTPUT

Additional context

da-ekchajzer commented 9 months ago

Thanks for your wish to contribute !

Process for adding instances

In https://github.com/Boavizta/boaviztapi/blob/dev/boaviztapi/data/archetypes/cloud/aws.csv add a new line. The ID correspond to the name of the instance. The format is the following :

id	manufacturer	CASE.case_type	year	vcpu	platforme_vcpu	CPU.units	CPU.core_units	CPU.name	CPU.manufacturer	CPU.model_range	CPU.family	CPU.tdp	CPU.manufacture_date	instance.ram_capacity	RAM.capacity	RAM.units	SSD.units	SSD.capacity	HDD.units	GPU.name	GPU.units	GPU.TDP	GPU.memory_capacity	POWER_SUPPLY.units	POWER_SUPPLY.unit_weight	USAGE.instance_per_server	USAGE.time_workload	USAGE.use_time_ratio	USAGE.hours_life_time	USAGE.other_consumption_ratio	USAGE.overcommited	Warnings	configuration.disk.units	configuration.disk.type	configuration.disk.capacity
a1.2xlarge	AWS	rack	2018	8	16	1	16	Graviton	Annapurna Labs	Graviton	Graviton	40	2018	16	16	2	0	0	0					2;2;2	2.99;1;5	2	50;0;100	1	35040	0.33;0.2;0.6	0		0

If the CPU of the instance is not listed in https://github.com/Boavizta/boaviztapi/blob/dev/boaviztapi/data/crowdsourcing/cpu_specs.csv it is recommended to add it if you have some information on the CPU. If not, the completion process will be used each time the instance is requested.
When an information is not known, leave it empty, the API will complete it.

Datasource

Most configuration information should be reported in AWS documentation or third party services. See for c5.12xlarge :

AWS doc : https://aws.amazon.com/ec2/instance-types/c5/#Product_Details
Third party : https://instances.vantage.sh/aws/ec2/c5.12xlarge

Automatization

We don't have an automatic process to fulfill our database, if you wish to do it automatically, feel free to provider a Jupyter notebook.

Hops it answers all your questions. Feel free to ask others.

JacobValdemar commented 9 months ago

@da-ekchajzer thanks for the description! So basically I should just leave empty if I can't find the data?

Existing data

I have looked at the existing data to see if I can understand how existing data is extrapolated from external data sources. However, it seems that there is some errors in existing data. Can that be true? If there are errors in existing data, should I then fix them? Also it seems that for many instances, defaults are hardcoded into the datasheet, shouldn't they be omitted as you said? For example, POWER_SUPPLY.units is specified as 2;2;2 for all rows.

Also, should Previous Generation Instances be removed? Such as c1.medium and c1.xlarge?

New data I add

I have tried to create a table describing where to get data for the columns. Does it seem right? All the places I have put just `?` means I am currently unable to figure out where/how to get the data.	column	constant	instances.vantage.sh data export
id		"API Name"
manufacturer	AWS
CASE.case_type	rack
year			https://instancetyp.es/
vcpu		"vCPUs"
platforme_vcpu		?
CPU.units		?
CPU.core_units		?
CPU.name		"Physical Processor", extrapolated
CPU.manufacturer		"Physical Processor", extrapolated
CPU.model_range		"Physical Processor", extrapolated
CPU.family			aws webpage instance type description
CPU.tdp			lookup crowdsourcing/cpu_specs
CPU.manufacture_date			same as year? or cpu_specs.release_date?
instance.ram_capacity		"Instance Memory"
RAM.capacity		? "Instance Memory" !=/= RAM.capacity * RAM.units / usage.instance_per_server ?	existing data doesn't seem valid if
RAM.units		?
SSD.units		"Instance Storage", extrapolated	aws webpage instance type table
SSD.capacity		"Instance Storage", extrapolated	aws webpage instance type table
HDD.units		?
HDD.units		?
GPU.name		"GPU model"
GPU.units		"GPUs"
GPU.TDP		?
GPU.memory_capacity		BAD or "GPU memory" / "GPUs" or "GPU memory"
POWER_SUPPLY.units	2;2;2 why?
POWER_SUPPLY.unit_weight	2.99;1;5 why?
USAGE.instance_per_server		metal.vCPU / "vCPUs", does apparently not go for all types
USAGE.time_workload	50;0;100 ? why?? is it default;min;max?
USAGE.use_time_ratio	1
USAGE.hours_life_time	35040 (4 years, but many has manufacture_date < 2019?)
USAGE.other_consumption_ratio	0.33;0.2;0.6 (PUE?) source? what is x;y;z ?
USAGE.overcommited	0 or 1, why which?
Warnings
configuration.disk.units		?
configuration.disk.type		?
configuration.disk.capacity		?

da-ekchajzer commented 9 months ago

So basically I should just leave empty if I can't find the data?

Yes and no :). I was a little to quick on my response. I would say as a first step yes, and then we will discuss on how to account for unknown data (either default value or range)

Range (value;min;max)

To address uncertainty in situations where a value is not known or cannot be determined, we employ a default value accompanied by a minimum and maximum range. These parameters are utilized in the impact calculation procedure to evaluate a spectrum of potential impacts, including average, minimum, and maximum values. When specifying a range for a value within a CSV file, we format it as follows: value;min;max.

Existing data

I have looked at the existing data to see if I can understand how existing data is extrapolated from external data sources. However, it seems that there is some errors in existing data. Can that be true?

It is definitely possible that some data are wrong either because of inadvertent errors in the manual process or because data have changed. Feel free to generate an other file that follow the same format.

If there are errors in existing data, should I then fix them?

Yes feel free ! You can also (if possible) add a source row to track where the data come from.

Also it seems that for many instances, defaults are hardcoded into the datasheet, shouldn't they be omitted as you said? For example, POWER_SUPPLY.units is specified as 2;2;2 for all rows.

Yes, some mandatory data are hard-coded. You can leave them blank when you are not sure, and I will complete them on your PR. Don't hesitate to ask on a case-by-case basis if you want to understand our assumptions.

Regarding power supply, we assume that there are 2 units. I invite you to keep this assumption if you have no information to the contrary.

New data I add

What is important to understand is that the components described in the file correspondent to the all machine/platform (which is equivalent to the metal version of the EC2 type).

column	Comments	constant	instances.vantage.sh data export	other source
id	yes		"API Name"
manufacturer	yes	AWS
CASE.case_type	yes	rack
year	yes but not used in the calculation, so I wouldn't make this data mandatory			https://instancetyp.es/
vcpu	yes		"vCPUs"
platforme_vcpu	plarform == metal : metal.vCPU		?
CPU.units	Correspond to the number of CPU for the platform. You can retrieve the number of CPU of the platform from the number of vCPU for a given CPU name and the number of CPU of the platform : platform_vcpu / nb_vcpu(cpu_name)		?
CPU.core_units	Numbers of cores per CPU (usually 1 core == 2vCPUs), if not provided will be completed from crowdsourcing/cpu_specs		?
CPU.name	Yes, should match a CPU in crowdsourcing/cpu_specs		"Physical Processor", extrapolated
CPU.manufacturer	Yes, if not provided will be completed from crowdsourcing/cpu_specs		"Physical Processor", extrapolated
CPU.model_range	Yes, if not provided will be completed from crowdsourcing/cpu_specs		"Physical Processor", extrapolated
CPU.family	Yes, if not provided will be completed from crowdsourcing/cpu_specs (in cpu_specs family == code_name)			aws webpage instance type description
CPU.tdp	Yes, if not provided will be completed from crowdsourcing/cpu_specs			lookup crowdsourcing/cpu_specs
CPU.manufacture_date	yes but not used in the calculation, so I wouldn't make this data mandatory			same as year? or cpu_specs.release_date?
instance.ram_capacity	yes		"Instance Memory"
RAM.capacity	See ### RAM		? "Instance Memory" !=/= RAM.capacity * RAM.units / usage.instance_per_server ?	existing data doesn't seem valid if
RAM.units	See ### RAM		?
SSD.units	Yes only if the instance host a SSD (no EBS)		"Instance Storage", extrapolated	aws webpage instance type table
SSD.capacity	Yes only if the instance host a SSD (no EBS)		"Instance Storage", extrapolated	aws webpage instance type table
HDD.units	I think no instance has an HDD		?
HDD.units	I think no instance has an HDD		?
GPU.name	Yes, GPU are not taken into account for now but will be soon so feel free to collect this data		"GPU model"
GPU.units	Yes, GPU are not taken into account for now but will be soon so feel free to collect this data		"GPUs"
GPU.TDP	Will be completed from name in future versions		?
GPU.memory_capacity	Memory for 1 GPU		BAD or "GPU memory" / "GPUs" or "GPU memory"
POWER_SUPPLY.units	1 + 1 backup	2;2;2 why?
POWER_SUPPLY.unit_weight	Since we don't know this data we have a range between 1 and 5 with an average of 2.99	2.99;1;5 why?
USAGE.instance_per_server	Yes. I would be interested to see the errors		metal.vCPU / "vCPUs", does apparently not go for all types
USAGE.time_workload	Yes. If not specified by users when the instance is requested we will compute it for 0%, 50% and 100M	50;0;100 ? why?? is it default;min;max?
USAGE.use_time_ratio	Yes, it means that its up 100% of the time.	1
USAGE.hours_life_time	It is a theoretical life duration. Used to allocate the embedded impacts.	35040 (4 years, but many has manufacture_date < 2019?)
USAGE.other_consumption_ratio	We only model the consumption of CPU and RAM. This ratio it used to account for the consumption of other components.	0.33;0.2;0.6 (PUE?) source? what is x;y;z ?
USAGE.overcommited	Boolean. Not used for now. If overcommitted a vCPU might be shared. Its impacts should also be shared.	0 or 1, why which?
Warnings
configuration.disk.units	Legacy. It has been removed		?
configuration.disk.type	Legacy. It has been removed		?
configuration.disk.capacity	Legacy. It has been removed		?

RAM

For the ram we need to know both the numer of strip and the capacity of each strip for the platform. What we have usually is the total quantity of RAM for the EC2 metal. In this case, we're looking for a logical distribution of ram strips (the maximal capacity possible between 8 GB, 16GB, 32GB, 128GB). Example :

r6g.metal : 512.0 GB ==> 2*128 GB

It's very unscientific, but it's the best we can do. Hence the warning "RAM.capacity not verified".

Feel free to join our public chat if you wan't to discuss or to launch a synchronous call : https://chat.boavizta.org/signup_user_complete/?id=97a1cpe35by49jdc66ej7ktrjc

da-ekchajzer commented 7 months ago

Missing instance have been added in PR https://github.com/Boavizta/boaviztapi/pull/237

Boavizta / boaviztapi