akash-network / support

Akash Support and Issue Tracking
5 stars 3 forks source link

Increase max limits (per deployment) for CPU, memory and storage #140

Closed anilmurty closed 8 months ago

anilmurty commented 8 months ago

Akash Deployments currently allow a max of the following (ref:https://github.com/akash-network/akash-api/blob/ea71fbd0bee740198034bf1b0261c90baea88be0/go/node/deployment/v1beta3/validation_config.go#L45):

MaxUnitCPU:     256 * 1000, // 256 CPUs
MaxUnitGPU:     100,
MaxUnitMemory:  512 * unit.Gi, // 512 Gi
MaxUnitStorage: 32 * unit.Ti,  // 32 Ti
MaxUnitCount:   50,
MaxUnitPrice:   10000000, // 10akt

MinUnitCPU:     10,
MinUnitGPU:     0,
MinUnitMemory:  unit.Mi,
MinUnitStorage: 5 * unit.Mi,
MinUnitCount:   1,

MaxGroupCount: 20,
MaxGroupUnits: 20,

MaxGroupCPU:     512 * 1000,
MaxGroupGPU:     512,
MaxGroupMemory:  1024 * unit.Gi,
MaxGroupStorage: 32 * unit.Ti,

This means the most number of vCPUs that a single deployment can request is 512. This is a fairly sever limitation when running AI training workloads that can sometimes need more CPUs. Similar issue with memory - we limit to 1024 (and AI workloads need to store large amounts of data on memory).

These limits are akash specific and the base k8s supports higher limits. These limits were introduced when we launched the initial mainnet and were put in place as a safeguard against misuse. Now that we have the ability to whitelist deployment wallets per provider (to protect against misuse), I think it is safe to increase these limits.

@troian is currently researching what are good new limits to set and will update this issue when he has a recommendation but the immediate need is for a customer to be able to request 1024 vCPUs and 4096GB of memory

anilmurty commented 8 months ago

Nov 7:

Plan to set MaxUnitCPU to 384 Plan to set MaxGroupCPU to MaxUnitCPU*MaxGroupCount (were not doing this in the past)

Separately: we should also consider increasing the number of volumes that can be mounted per node (we currently support one persistent and one ephemeral). Can take this up as a separate issue.

The challenge with this is that while it doesn't require a network upgrade

@brewsterdrinkwater - we will need to validators to have them upgrade ahead of the mainnet upgrade

troian commented 8 months ago

released in node v0.26.2