akash-network / community

Starting point for joining and contributing to building Akash Network
MIT License
54 stars 28 forks source link

[Provider Audit]: gpu3090.ddns.net #681

Closed jakestanton2016 closed 1 month ago

jakestanton2016 commented 2 months ago

Prerequisite Steps:

1. Make sure your provider has community provider attributes and your contact details (email, website):

  Example:
  $ provider-services query provider get akash1<REDACTED> -o text
  ...
  attributes:
  ...
  - key: host
  value: akash
  - key: tier
  value: community
  info:
    email: "<your email>"
    website: "<your website>"

Ref documentation:.

2. Make sure your provider *.ingress resolves to your provider IP (ideally worker node IP)

host <anything>.ingress.<yourdomain>

Example:

$ host anything.ingress.akash.pro
anything.ingress.akash.pro is an alias for nodes.akash.pro.
nodes.akash.pro has address 65.108.6.185

3. Please make sure your Akash provider doesn't block any Akash specific ports.

Audit Steps:

1. Title the issue: " [Provider Audit]: Provider Address" (e.g. "[Provider Audit]: provider.europlots.com")

2. Wait for response via comments. If no issues during provider Audit, process will be complete, provider should start bidding on leases, and Audit ticket will be closed.

3. If there are issues during the provider Audit, debug those issues, and Audit will be complete.

4. Audit Issue will be closed by core team member.

Leave contact information (optional)

  1. Name - Jake
  2. Discord handle or Telegram handle - Stanton2495
  3. Contact email address -jakeminer2021@gmail.com
shimpa1 commented 2 months ago

Good morning,

Other remarks:

sda 8:0 0 111.8G 0 disk ├─sda1 8:1 0 1G 0 part └─sda2 8:2 0 110.7G 0 part nvme1n1 259:0 0 953.9G 0 disk └─md0 9:0 0 1.4T 0 raid0 /etc/resolv.conf /etc/hostname /dev/termination-log /etc/hosts nvme0n1 259:1 0 465.8G 0 disk └─md0 9:0 0 1.4T 0 raid0 /etc/resolv.conf /etc/hostname /dev/termination-log /etc/hosts

This is bad practice on multiple levels: 1. RAID0 generally decreases stability by adding yet another SPOF. 2. RAID0 needs equally sized drives to work properly.

curl -sk https://provider.gpu3090.ddns.net:8443/status | jq { "cluster": { "leases": 3, "inventory": { "active": [ { "cpu": 100, "gpu": 0, "memory": 100663296, "storage_ephemeral": 68157440 }, { "cpu": 100, "gpu": 0, "memory": 268435456, "storage_ephemeral": 268435456 }, { "cpu": 100, "gpu": 0, "memory": 100663296, "storage_ephemeral": 6291456 } ], "available": { "nodes": [ { "name": "node1", "allocatable": { "cpu": 4000, "gpu": 0, "memory": 12308656128, "storage_ephemeral": 224812917593 }, "available": { "cpu": 3080, "gpu": 0, "memory": 12087369728, "storage_ephemeral": 224812917593 } }, { "name": "node2", "allocatable": { "cpu": 8000, "gpu": 1, "memory": 8126627840, "storage_ephemeral": 1349073758432 }, "available": { "cpu": 5195, "gpu": 0, "memory": 3229161472, "storage_ephemeral": 1348194003168 } } ] } } }, "bidengine": { "orders": 0 }, "manifest": { "deployments": 0 }, "cluster_public_hostname": "provider.gpu3090.ddns.net", "address": "akash16q46mn8tm7vtwn4h9rugr8ptch6shp2qda7y89" }

This suggests that there's an issue either with the nVidia drivers, nVidia toolkit, or K8S plugin : Please refer to https://akash.network/docs/providers/build-a-cloud-provider/gpu-resource-enablement/

Please fix these issues and we can move on.

Shimpa

jakestanton2016 commented 2 months ago

Thanks… working on fixes

jakestanton2016 commented 1 month ago

I have made the fixes suggested. Lost on-time performance due to downtime. Please re-evaluate. Thank you.

shimpa1 commented 1 month ago

The provider is not answering to orders that include a GPU, even though a GPU is present in the inventory and available.

Requested resources: 1 GPU nVidia rtx3090 1 CPU 2 Gi RAM 2 Gi storage

Please make sure your provider is fully functional.

thanks. Shimpa

andy108369 commented 1 month ago

Provider is still offline and it's been quite long time. Please feel free to reopen if needed.