Closed anilmurty closed 3 months ago
Per Feb 20 call: We are leaning towards implementing full support for SHM (not just the workaround with bid attributes). @boz is planning to take this on (Thanks Adam!)
In the interim @troian and @chainzero are going to look into the workaround with using bid attributes + a daemon running on the provider that checks the attributes and applies SHM using kubectl commands
March 4th, 2024:
March 12th, 2024:
Does not need a network upgrade. No SDL changes.
akash network 0.32.2 provider-services 0.5.9
shm doesn't seem to be working yet.
$ provider-services query provider get akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk -o text
attributes:
- key: host
value: akash
- key: organization
value: overclock
- key: datacenter
value: hurricane
- key: capabilities/gpu/vendor/nvidia/model/t4
value: "true"
- key: capabilities/gpu/vendor/nvidia/model/t4/ram/16Gi
value: "true"
- key: capabilities/gpu/vendor/nvidia/model/t4/ram/16Gi/interface/pcie
value: "true"
- key: capabilities/gpu/vendor/nvidia/model/t4/interface/pcie
value: "true"
- key: capabilities/storage/1/class
value: default
- key: capabilities/storage/1/persistent
value: "true"
- key: capabilities/storage/2/class
value: beta3
- key: capabilities/storage/2/persistent
value: "true"
- key: capabilities/storage/3/class
value: ram
- key: capabilities/storage/3/persistent
value: "false"
- key: ip-lease
value: "true"
host_uri: https://provider.hurricane.akash.pub:8443
info:
email: hosting@ovrclk.com
website: https://akash.network
owner: akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
---
version: "2.0"
services:
ssh:
image: ubuntu:22.04
env:
- 'SSH_PUBKEY=ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINNFxqDbY0BlEjJ2y9B2IKUUoimOq6oAC7WcsQT8qmII andrey@akash.network'
command:
- "sh"
- "-c"
args:
- 'apt-get update;
apt-get install -y --no-install-recommends -- ssh;
mkdir -p -m0755 /run/sshd;
mkdir -m700 ~/.ssh;
echo "$SSH_PUBKEY" | tee ~/.ssh/authorized_keys;
chmod 0600 ~/.ssh/authorized_keys;
ls -lad ~ ~/.ssh ~/.ssh/authorized_keys;
md5sum ~/.ssh/authorized_keys;
exec /usr/sbin/sshd -D'
params:
storage:
shm:
mount: /dev/shm
expose:
- port: 8080
as: 80
to:
- global: true
# SSH
- port: 22
as: 22
to:
- global: true
profiles:
compute:
ssh:
resources:
cpu:
units: 1
memory:
size: 4Gi
storage:
- size: 10Gi
- name: shm
size: 2Gi
attributes:
class: ram
placement:
akash:
attributes:
host: akash
#organization: someorg
#signedBy:
# anyOf:
# - "akash1365yvmc4s7awdyj3n2sav7xfx76adc6dnmlx63"
pricing:
ssh:
denom: uakt
amount: 1000000
deployment:
ssh:
akash:
profile: ssh
count: 1
E[2024-03-30|21:36:59.567] applying deployment cmp=provider client=kube err="Deployment.apps "ssh" is invalid: spec.template.spec.containers[0].volumeMounts[0].name: Not found: "ssh-shm"" lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15662958/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk service=ssh
E[2024-03-30|21:36:59.567] unable to deploy lid=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15662958/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk. last known state:
cmp=provider client=kube
E[2024-03-30|21:36:59.567] deploying workload module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15662958/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=akash err="Deployment.apps "ssh" is invalid: spec.template.spec.containers[0].volumeMounts[0].name: Not found: "ssh-shm""
E[2024-03-30|21:36:59.567] execution error module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15662958/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=akash state=deploy-active err="Deployment.apps "ssh" is invalid: spec.template.spec.containers[0].volumeMounts[0].name: Not found: "ssh-shm""
@troian
In the case of two volumes - pers.volume + shm (ram) I'm getting "manifest version validation failed" from provider.
SDL:
---
version: "2.0"
services:
ssh:
image: ubuntu:22.04
env:
- 'SSH_PUBKEY=ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINNFxqDbY0BlEjJ2y9B2IKUUoimOq6oAC7WcsQT8qmII andrey@akash.network'
command:
- "sh"
- "-c"
args:
- 'apt-get update;
apt-get install -y --no-install-recommends -- ssh;
mkdir -p -m0755 /run/sshd;
mkdir -m700 ~/.ssh;
echo "$SSH_PUBKEY" | tee ~/.ssh/authorized_keys;
chmod 0600 ~/.ssh/authorized_keys;
ls -lad ~ ~/.ssh ~/.ssh/authorized_keys;
md5sum ~/.ssh/authorized_keys;
exec /usr/sbin/sshd -D'
params:
storage:
data:
mount: /root
shm:
mount: /dev/shm
expose:
- port: 8080
as: 80
to:
- global: true
# SSH
- port: 22
as: 22
to:
- global: true
profiles:
compute:
ssh:
resources:
cpu:
units: 1
memory:
size: 4Gi
storage:
- size: 10Gi
- name: data
size: 5Gi
attributes:
persistent: true
class: beta3
- name: shm
size: 2Gi
attributes:
class: ram
placement:
akash:
attributes:
host: akash
#organization: someorg
#signedBy:
# anyOf:
# - "akash1365yvmc4s7awdyj3n2sav7xfx76adc6dnmlx63"
pricing:
ssh:
denom: uakt
amount: 1000000
deployment:
ssh:
akash:
profile: ssh
count: 1
Client:
provider-services 0.5.9
arno@x1:~/git/akash-tools/cli-booster[https://rpc.akashnet.net:443][default][]$ akash_deploy ssh-shm-and-pers.yaml INFO: Broadcasting 'provider-services deployment create -y --deposit 500000uakt -- ssh-shm-and-pers.yaml' transaction... INFO: Waiting for the TX 1CCD212E8E216E23A168B32C922CDAED988C9710F2C15FF5FE601A61B6069BAB to get processed by the Akash network INFO: Success
arno@x1:~/git/akash-tools/cli-booster[https://rpc.akashnet.net:443][default][15663077--1]$ akash_accept
rate monthly usd dseq/gseq/oseq provider host
0> 1.00 0.42 $2.05 15663077/1/1 akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk provider.hurricane.akash.pub:8443
Choose your bid from the list [0]: 0
INFO: Accepting the bid offered by akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk provider for 15663077/1/1 deployment
INFO: Broadcasting 'provider-services market lease create -y' transaction...
INFO: Waiting for the TX 8C491ED0121C492912A0F38D57725758D66ABA16E9184531DF16EA4E70A976E4 to get processed by the Akash network
akINFO: Success
8C491ED0121C492912A0F38D57725758D66ABA16E9184531DF16EA4E70A976E4
arno@x1:~/git/akash-tools/cli-booster[https://rpc.akashnet.net:443][default][15663077-1-1]$ akash_send_manifest ssh-shm-and-pers.yaml Detected provider for 15663077/1/1: akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk [{"provider":"akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk","status":"FAIL","error":"remote server returned 500","errorMessage":"manifest version validation failed\n"}] Error: submit manifest to some providers has been failed ERROR: provider-services send-manifest failed with '1' code.
Provider (v0.5.9):
$ kubectl -n akash-services logs akash-provider-0 --tail=100 -f | grep -Evi 'check|result|IP|replicas|dump' Defaulted container "provider" out of: provider, init (init) I[2024-03-30|21:47:55.201] order detected module=bidengine-service cmp=provider order=order/akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1 I[2024-03-30|21:47:55.203] group fetched module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1 I[2024-03-30|21:47:55.203] requesting reservation module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1 D[2024-03-30|21:47:55.203] reservation requested. order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1, resources=[{"resource":{"id":1,"cpu":{"units":{"val":"1000"}},"memory":{"size":{"val":"4294967296"}},"storage":[{"name":"shm","size":{"val":"2147483648"},"attributes":[{"key":"class","value":"ram"},{"key":"persistent","value":"false"}]},{"name":"data","size":{"val":"5368709120"},"attributes":[{"key":"class","value":"beta3"},{"key":"persistent","value":"true"}]},{"name":"default","size":{"val":"10737418240"}}],"gpu":{"units":{"val":"0"}},"endpoints":[{"kind":1,"sequence_number":0},{"sequence_number":0}]},"count":1,"price":{"denom":"uakt","amount":"1000000.000000000000000000"}}] module=provider-cluster cmp=provider cmp=service cmp=inventory-service D[2024-03-30|21:47:55.203] reservation count module=provider-cluster cmp=provider cmp=service cmp=inventory-service cnt=1 I[2024-03-30|21:47:55.203] Reservation fulfilled module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1 D[2024-03-30|21:47:55.205] submitting fulfillment module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1 price=1.000000000000000000uakt
I[2024-03-30|21:48:01.322] bid complete module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1
I[2024-03-30|21:48:13.520] lease won module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1 lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk I[2024-03-30|21:48:13.520] shutting down module=bidengine-order cmp=provider order=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1 I[2024-03-30|21:48:13.520] lease won module=provider-manifest cmp=provider lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk I[2024-03-30|21:48:13.520] new lease module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk D[2024-03-30|21:48:13.521] watchdog start module=provider-manifest cmp=provider leaseID=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk I[2024-03-30|21:48:13.525] data received module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 version=9d34f853c8a02e32abd64aaec0900c67abbdf9be5584177c33753198638d8ab3
I[2024-03-30|21:48:20.899] watchdog done module=provider-manifest cmp=provider lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 I[2024-03-30|21:48:20.899] manifest received module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 I[2024-03-30|21:48:20.901] data received module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 version=9d34f853c8a02e32abd64aaec0900c67abbdf9be5584177c33753198638d8ab3 I[2024-03-30|21:48:20.901] deployment version mismatch module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 expected=9D34F853C8A02E32ABD64AAEC0900C67ABBDF9BE5584177C33753198638D8AB3 got=95A422D963420A7C974C8F8B8EC0569CDA801EFB2B1C3595634F5891B5030A4E E[2024-03-30|21:48:20.901] invalid manifest: %s module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 manifestversionvalidationfailed=(MISSING) D[2024-03-30|21:48:20.901] requests valid module=manifest-manager cmp=provider deployment=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/15663077 num-requests=0 E[2024-03-30|21:48:20.901] manifest submit failed cmp=provider err="manifest version validation failed"
I've tested the provider-services 0.5.11 - everything is working there.
Details https://github.com/akash-network/helm-charts/pull/268
Is your feature request related to a problem? Please describe.
Customers (particularly AI/ ML training workloads) frequently need to be able to have multiple services share storage - for example one service that is downloading data and labeling is CPU bound, while another that uses the data for training is GPU bound and they can run in parallel but need to access large shared memory. We currently don't allow the max SHM size to be controllable by the user which makes it hard to run such workloads.
Describe the solution you'd like
Support being able to specify and request SHM size as part of the SDL
Describe alternatives you've considered
Search
Code of Conduct
Additional context
No response