Closed andy108369 closed 3 months ago
$ kubectl -n akash-services logs deployment/operator-inventory | grep -v 'MODIFIED monitoring CephCluster'
I[2024-03-19|15:40:12.454] using in cluster kube config cmp=provider
INFO nodes.nodes waiting for nodes to finish
INFO watcher.storageclasses started
INFO rest listening on ":8080"
INFO grpc listening on ":8081"
INFO watcher.config started
INFO nodes.node.monitor starting {"node": "node10"}
INFO nodes.node.monitor starting {"node": "node11"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node1"}
INFO nodes.node.monitor starting {"node": "node12"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node12"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node11"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node10"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node13"}
INFO nodes.node.monitor starting {"node": "node1"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node14"}
INFO nodes.node.monitor starting {"node": "node14"}
INFO nodes.node.monitor starting {"node": "node13"}
INFO nodes.node.monitor starting {"node": "node15"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node15"}
INFO nodes.node.monitor starting {"node": "node2"}
INFO nodes.node.monitor starting {"node": "node16"}
INFO nodes.node.monitor starting {"node": "node3"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node2"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node16"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node3"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node4"}
INFO nodes.node.monitor starting {"node": "node5"}
INFO nodes.node.monitor starting {"node": "node4"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node5"}
INFO nodes.node.monitor starting {"node": "node6"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node6"}
INFO nodes.node.monitor starting {"node": "node7"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node7"}
INFO nodes.node.monitor starting {"node": "node8"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node8"}
INFO nodes.node.monitor starting {"node": "node9"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node9"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node6"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node7"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node8"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node4"}
INFO nodes.node.monitor started {"node": "node6"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node1"}
INFO nodes.node.monitor started {"node": "node8"}
INFO nodes.node.monitor started {"node": "node7"}
INFO nodes.node.monitor started {"node": "node4"}
INFO nodes.node.monitor started {"node": "node1"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node10"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node2"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node9"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node14"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node11"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node3"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node5"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node15"}
INFO nodes.node.monitor started {"node": "node2"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node12"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node13"}
INFO nodes.node.monitor started {"node": "node11"}
INFO nodes.node.monitor started {"node": "node14"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node16"}
INFO nodes.node.monitor started {"node": "node5"}
INFO nodes.node.monitor started {"node": "node13"}
INFO nodes.node.monitor started {"node": "node16"}
INFO nodes.node.monitor started {"node": "node9"}
INFO nodes.node.monitor started {"node": "node15"}
INFO nodes.node.monitor started {"node": "node12"}
INFO nodes.node.monitor started {"node": "node3"}
INFO nodes.node.monitor started {"node": "node10"}
ERROR watcher.registry couldn't query pci.ids {"error": "Get \"\": unsupported protocol scheme \"\""}
INFO nodes.node.monitor successfully applied labels and/or annotations patches for node "node1" {"labels": {}}
INFO nodes.node.monitor successfully applied labels and/or annotations patches for node "node1" {"labels": {"akash.network":"true","akash.network/capabilities.gpu.vendor.nvidia.model.h100":"8","akash.network/capabilities.gpu.vendor.nvidia.model.h100.interface.PCIe":"8","akash.network/capabilities.gpu.vendor.nvidia.model.h100.ram.80Gi":"8"}}
ERROR watcher.registry couldn't query inventory registry {"error": "Get \"https://provider-configs.akash.network/devices/gpus\": dial tcp [2606:4700:130:436c:6f75:6466:6c61:7265]:443: connect: network is unreachable"}
$ kubectl rollout restart deployment/operator-inventory -n akash-services
deployment.apps/operator-inventory restarted
$ kubectl -n akash-services logs deployment/operator-inventory | grep -v 'MODIFIED monitoring CephCluster'
I[2024-03-24|12:12:03.135] using in cluster kube config cmp=provider
INFO nodes.nodes waiting for nodes to finish
INFO rest listening on ":8080"
INFO watcher.storageclasses started
INFO grpc listening on ":8081"
INFO watcher.config started
INFO nodes.node.discovery starting hardware discovery pod {"node": "node12"}
INFO nodes.node.monitor starting {"node": "node1"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node11"}
INFO nodes.node.monitor starting {"node": "node10"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node14"}
INFO nodes.node.monitor starting {"node": "node12"}
INFO nodes.node.monitor starting {"node": "node14"}
INFO nodes.node.monitor starting {"node": "node16"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node16"}
INFO nodes.node.monitor starting {"node": "node2"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node1"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node10"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node13"}
INFO nodes.node.monitor starting {"node": "node11"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node2"}
INFO nodes.node.monitor starting {"node": "node15"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node15"}
INFO nodes.node.monitor starting {"node": "node13"}
INFO nodes.node.monitor starting {"node": "node3"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node3"}
INFO nodes.node.monitor starting {"node": "node5"}
INFO nodes.node.monitor starting {"node": "node4"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node5"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node4"}
INFO nodes.node.monitor starting {"node": "node6"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node6"}
INFO nodes.node.monitor starting {"node": "node7"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node7"}
INFO nodes.node.monitor starting {"node": "node8"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node8"}
INFO nodes.node.monitor starting {"node": "node9"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node9"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node14"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node10"}
ERROR nodes.node.monitor unable to query cpu {"error": "error trying to reach service: dial tcp 10.233.91.226:8081: connect: connection refused"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node13"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node11"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node16"}
INFO nodes.node.monitor started {"node": "node10"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node5"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node15"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node12"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node4"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node6"}
INFO nodes.node.monitor started {"node": "node15"}
INFO nodes.node.monitor started {"node": "node12"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node9"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node8"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node1"}
INFO nodes.node.monitor started {"node": "node13"}
INFO nodes.node.monitor started {"node": "node4"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node3"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node7"}
INFO nodes.node.monitor started {"node": "node9"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node2"}
INFO nodes.node.monitor started {"node": "node8"}
INFO nodes.node.monitor started {"node": "node1"}
INFO nodes.node.monitor started {"node": "node7"}
INFO nodes.node.monitor started {"node": "node2"}
INFO nodes.node.monitor started {"node": "node5"}
INFO nodes.node.monitor started {"node": "node14"}
INFO nodes.node.monitor started {"node": "node11"}
INFO nodes.node.monitor started {"node": "node16"}
INFO nodes.node.monitor started {"node": "node6"}
INFO nodes.node.monitor started {"node": "node3"}
$ provider_info2.sh provider.mon.obl.akash.pub
PROVIDER INFO
"hostname" "address"
"provider.mon.obl.akash.pub" "akash1g7az2pus6atgeufgttlcnl0wzlzwd0lrsy6d7s"
Total/Allocatable/Used (t/a/u) per node:
"name" "cpu(t/a/u)" "gpu(t/a/u)" "mem(t/a/u GiB)" "ephemeral(t/a/u GiB)"
"node1" "252/242/10" "8/3/5" "1417.21/1365.43/51.78" "5756.74/5656.74/100"
"node10" "128/115.525/12.475" "8/8/0" "472.06/469.75/2.31" "1482.67/1478.67/4"
"node11" "128/114.525/13.475" "8/8/0" "472.06/469/3.06" "1482.67/1477.92/4.75"
"node12" "128/113.425/14.575" "8/8/0" "472.06/469.25/2.81" "1482.67/1478.17/4.5"
"node13" "128/57.525/70.475" "8/8/0" "472.06/456/16.06" "1482.67/1480.66/2.01"
"node14" "128/115.025/12.975" "8/8/0" "472.06/469.25/2.81" "1482.67/1478.17/4.5"
"node15" "128/84.495/43.505" "8/7/1" "472.06/437.79/34.27" "1482.67/1357.91/124.76"
"node16" "128/121.525/6.475" "8/8/0" "472.06/470.75/1.31" "1482.67/1480.67/2"
"node2" "252/182.98/69.02" "8/7/1" "1417.21/1372.37/44.85" "5756.74/5636.73/120.01"
"node3" "252/187.525/64.475" "8/0/8" "1417.21/776.9/640.31" "5756.74/3708.74/2048"
"node4" "252/2.125/249.875" "8/0/8" "1417.21/112.22/1305" "5756.74/3697.99/2058.75"
"node5" "252/121.525/130.475" "8/0/8" "1417.21/134.9/1282.31" "5756.74/3706.74/2050"
"node6" "252/118.525/133.475" "8/0/8" "1417.21/1328.4/88.81" "5756.74/4722.24/1034.5"
"node7" "252/63.625/188.375" "8/0/8" "1417.21/751.97/665.25" "5756.74/3698.74/2058"
"node8" "252/123.625/128.375" "8/0/8" "1417.21/136.97/1280.25" "5756.74/3708.74/2048"
"node9" "128/127.125/0.875" "8/8/0" "472.06/471.57/0.5" "1482.67/1482.67/0"
ACTIVE TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
613.1 6 197.2 349.77 0 0 0
PERSISTENT STORAGE:
"storage class" "available space(GiB)"
PENDING TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
The ERROR "dial tcp 10.233.91.226:8081: connect: connection refused" is wrong because it is accessible.
Most likely this indicates that deployment/operator-inventory
did not wait enough before attempting to query the operator-inventory-hardware-discovery-node10
pod.
$ kubectl get pods -A -o wide | grep 10.233.91.226
akash-services operator-inventory-hardware-discovery-node10 1/1 Running 0 3m20s 10.233.91.226 node10 <none> <none>
$ kubectl -n akash-services logs operator-inventory-hardware-discovery-node10
listening on :8081
and I can access it:
$ kubectl exec -ti deployment/operator-inventory -n akash-services -- bash
root@operator-inventory-65d8bd7b-vdsdf:/# curl 10.233.91.226:8081
{"errors":null,"cpu":{"total_cores":128,"total_threads":128,"processors":[{"id":0,"total_cores":64,"total_threads":64,"vendor":"AuthenticAMD","model":"AMD EPYC 7763 64-Core Processor","capabilities":["fpu","vme","de","pse","tsc","msr","pae","mce","cx8","apic","sep","mtrr","pge","mca","cmov","pat","pse36","clflush","mmx","fxsr","sse","sse2","ht","syscall","nx","mmxext","fxsr_opt","pdpe1gb","rdtscp","lm","rep_good","nopl","cpuid","extd_apicid","tsc_known_freq","pni","pclmulqdq","ssse3","fma","cx16","pcid","sse4_1","sse4_2","x2apic","movbe","popcnt","tsc_deadline_timer","aes","xsave","avx","f16c","rdrand","hypervisor","lahf_lm","cmp_legacy","svm","cr8_legacy","abm","sse4a","misalignsse","3dnowprefetch","osvw","perfctr_core","invpcid_single","ssbd","ibrs","ibpb","stibp","vmmcall","fsgsbase","tsc_adjust","bmi1","avx2","smep","bmi2","erms","invpcid","rdseed","adx","smap","clflushopt","clwb","sha_ni","xsaveopt","xsavec","xgetbv1","xsaves","clzero","xsaveerptr","wbnoinvd","arat","npt","nrip_save","umip","pku","ospke","vaes","vpclmulqdq","rdpid","fsrm","arch_capabilities"],"cores":[{"id":0,"total_threads":1,"logical_processors":[0]},{"id":1,"total_threads":1,"logical_processors":[1]},{"id":10,"total_threads":1,"logical_processors":[10]},{"id":11,"total_threads":1,"logical_processors":[11]},{"id":12,"total_threads":1,"logical_processors":[12]},{"id":13,"total_threads":1,"logical_processors":[13]},{"id":14,"total_threads":1,"logical_processors":[14]},{"id":15,"total_threads":1,"logical_processors":[15]},{"id":16,"total_threads":1,"logical_processors":[16]},{"id":17,"total_threads":1,"logical_processors":[17]},{"id":18,"total_threads":1,"logical_processors":[18]},{"id":19,"total_threads":1,"logical_processors":[19]},{"id":2,"total_threads":1,"logical_processors":[2]},{"id":20,"total_threads":1,"logical_processors":[20]},{"id":21,"total_threads":1,"logical_processors":[21]},{"id":22,"total_threads":1,"logical_processors":[22]},{"id":23,"total_threads":1,"logical_processors":[23]},{"id":24,"total_threads":1,"logical_processors":[24]},{"id":25,"total_threads":1,"logical_processors":[25]},{"id":26,"total_threads":1,"logical_processors":[26]},{"id":27,"total_threads":1,"logical_processors":[27]},{"id":28,"total_threads":1,"logical_processors":[28]},{"id":29,"total_threads":1,"logical_processors":[29]},{"id":3,"total_threads":1,"logical_processors":[3]},{"id":30,"total_threads":1,"logical_processors":[30]},{"id":31,"total_threads":1,"logical_processors":[31]},{"id":32,"total_threads":1,"logical_processors":[32]},{"id":33,"total_threads":1,"logical_processors":[33]},{"id":34,"total_threads":1,"logical_processors":[34]},{"id":35,"total_threads":1,"logical_processors":[35]},{"id":36,"total_threads":1,"logical_processors":[36]},{"id":37,"total_threads":1,"logical_processors":[37]},{"id":38,"total_threads":1,"logical_processors":[38]},{"id":39,"total_threads":1,"logical_processors":[39]},{"id":4,"total_threads":1,"logical_processors":[4]},{"id":40,"total_threads":1,"logical_processors":[40]},{"id":41,"total_threads":1,"logical_processors":[41]},{"id":42,"total_threads":1,"logical_processors":[42]},{"id":43,"total_threads":1,"logical_processors":[43]},{"id":44,"total_threads":1,"logical_processors":[44]},{"id":45,"total_threads":1,"logical_processors":[45]},{"id":46,"total_threads":1,"logical_processors":[46]},{"id":47,"total_threads":1,"logical_processors":[47]},{"id":48,"total_threads":1,"logical_processors":[48]},{"id":49,"total_threads":1,"logical_processors":[49]},{"id":5,"total_threads":1,"logical_processors":[5]},{"id":50,"total_threads":1,"logical_processors":[50]},{"id":51,"total_threads":1,"logical_processors":[51]},{"id":52,"total_threads":1,"logical_processors":[52]},{"id":53,"total_threads":1,"logical_processors":[53]},{"id":54,"total_threads":1,"logical_processors":[54]},{"id":55,"total_threads":1,"logical_processors":[55]},{"id":56,"total_threads":1,"logical_processors":[56]},{"id":57,"total_threads":1,"logical_processors":[57]},{"id":58,"total_threads":1,"logical_processors":[58]},{"id":59,"total_threads":1,"logical_processors":[59]},{"id":6,"total_threads":1,"logical_processors":[6]},{"id":60,"total_threads":1,"logical_processors":[60]},{"id":61,"total_threads":1,"logical_processors":[61]},{"id":62,"total_threads":1,"logical_processors":[62]},{"id":63,"total_threads":1,"logical_processors":[63]},{"id":7,"total_threads":1,"logical_processors":[7]},{"id":8,"total_threads":1,"logical_processors":[8]},{"id":9,"total_threads":1,"logical_processors":[9]}]},{"id":1,"total_cores":64,"total_threads":64,"vendor":"AuthenticAMD","model":"AMD EPYC 7763 64-Core Processor","capabilities":["fpu","vme","de","pse","tsc","msr","pae","mce","cx8","apic","sep","mtrr","pge","mca","cmov","pat","pse36","clflush","mmx","fxsr","sse","sse2","ht","syscall","nx","mmxext","fxsr_opt","pdpe1gb","rdtscp","lm","rep_good","nopl","cpuid","extd_apicid","tsc_known_freq","pni","pclmulqdq","ssse3","fma","cx16","pcid","sse4_1","sse4_2","x2apic","movbe","popcnt","tsc_deadline_timer","aes","xsave","avx","f16c","rdrand","hypervisor","lahf_lm","cmp_legacy","svm","cr8_legacy","abm","sse4a","misalignsse","3dnowprefetch","osvw","perfctr_core","invpcid_single","ssbd","ibrs","ibpb","stibp","vmmcall","fsgsbase","tsc_adjust","bmi1","avx2","smep","bmi2","erms","invpcid","rdseed","adx","smap","clflushopt","clwb","sha_ni","xsaveopt","xsavec","xgetbv1","xsaves","clzero","xsaveerptr","wbnoinvd","arat","npt","nrip_save","umip","pku","ospke","vaes","vpclmulqdq","rdpid","fsrm","arch_capabilities"],"cores":[{"id":36,"total_threads":1,"logical_processors":[100]},{"id":37,"total_threads":1,"logical_processors":[101]},{"id":38,"total_threads":1,"logical_processors":[102]},{"id":39,"total_threads":1,"logical_processors":[103]},{"id":40,"total_threads":1,"logical_processors":[104]},{"id":41,"total_threads":1,"logical_processors":[105]},{"id":42,"total_threads":1,"logical_processors":[106]},{"id":43,"total_threads":1,"logical_processors":[107]},{"id":44,"total_threads":1,"logical_processors":[108]},{"id":45,"total_threads":1,"logical_processors":[109]},{"id":46,"total_threads":1,"logical_processors":[110]},{"id":47,"total_threads":1,"logical_processors":[111]},{"id":48,"total_threads":1,"logical_processors":[112]},{"id":49,"total_threads":1,"logical_processors":[113]},{"id":50,"total_threads":1,"logical_processors":[114]},{"id":51,"total_threads":1,"logical_processors":[115]},{"id":52,"total_threads":1,"logical_processors":[116]},{"id":53,"total_threads":1,"logical_processors":[117]},{"id":54,"total_threads":1,"logical_processors":[118]},{"id":55,"total_threads":1,"logical_processors":[119]},{"id":56,"total_threads":1,"logical_processors":[120]},{"id":57,"total_threads":1,"logical_processors":[121]},{"id":58,"total_threads":1,"logical_processors":[122]},{"id":59,"total_threads":1,"logical_processors":[123]},{"id":60,"total_threads":1,"logical_processors":[124]},{"id":61,"total_threads":1,"logical_processors":[125]},{"id":62,"total_threads":1,"logical_processors":[126]},{"id":63,"total_threads":1,"logical_processors":[127]},{"id":0,"total_threads":1,"logical_processors":[64]},{"id":1,"total_threads":1,"logical_processors":[65]},{"id":2,"total_threads":1,"logical_processors":[66]},{"id":3,"total_threads":1,"logical_processors":[67]},{"id":4,"total_threads":1,"logical_processors":[68]},{"id":5,"total_threads":1,"logical_processors":[69]},{"id":6,"total_threads":1,"logical_processors":[70]},{"id":7,"total_threads":1,"logical_processors":[71]},{"id":8,"total_threads":1,"logical_processors":[72]},{"id":9,"total_threads":1,"logical_processors":[73]},{"id":10,"total_threads":1,"logical_processors":[74]},{"id":11,"total_threads":1,"logical_processors":[75]},{"id":12,"total_threads":1,"logical_processors":[76]},{"id":13,"total_threads":1,"logical_processors":[77]},{"id":14,"total_threads":1,"logical_processors":[78]},{"id":15,"total_threads":1,"logical_processors":[79]},{"id":16,"total_threads":1,"logical_processors":[80]},{"id":17,"total_threads":1,"logical_processors":[81]},{"id":18,"total_threads":1,"logical_processors":[82]},{"id":19,"total_threads":1,"logical_processors":[83]},{"id":20,"total_threads":1,"logical_processors":[84]},{"id":21,"total_threads":1,"logical_processors":[85]},{"id":22,"total_threads":1,"logical_processors":[86]},{"id":23,"total_threads":1,"logical_processors":[87]},{"id":24,"total_threads":1,"logical_processors":[88]},{"id":25,"total_threads":1,"logical_processors":[89]},{"id":26,"total_threads":1,"logical_processors":[90]},{"id":27,"total_threads":1,"logical_processors":[91]},{"id":28,"total_threads":1,"logical_processors":[92]},{"id":29,"total_threads":1,"logical_processors":[93]},{"id":30,"total_threads":1,"logical_processors":[94]},{"id":31,"total_threads":1,"logical_processors":[95]},{"id":32,"total_threads":1,"logical_processors":[96]},{"id":33,"total_threads":1,"logical_processors":[97]},{"id":34,"total_threads":1,"logical_processors":[98]},{"id":35,"total_threads":1,"logical_processors":[99]}]}]},"memory":{"total_physical_bytes":515396075520,"total_usable_bytes":506978603008,"supported_page_sizes":[1073741824,2097152],"modules":null},"gpu":{"cards":[{"address":"0000:00:02.0","index":0,"pci":{"driver":"cirrus","address":"0000:00:02.0","vendor":{"id":"1013","name":"Cirrus Logic"},"product":{"id":"00b8","name":"GD 5446"},"revision":"0x00","subsystem":{"id":"1100","name":"QEMU Virtual Machine"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}}},{"address":"0000:00:05.0","index":1,"pci":{"driver":"nvidia","address":"0000:00:05.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}}},{"address":"0000:00:06.0","index":2,"pci":{"driver":"nvidia","address":"0000:00:06.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}}},{"address":"0000:00:07.0","index":3,"pci":{"driver":"nvidia","address":"0000:00:07.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}}},{"address":"0000:00:08.0","index":4,"pci":{"driver":"nvidia","address":"0000:00:08.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}}},{"address":"0000:00:09.0","index":5,"pci":{"driver":"nvidia","address":"0000:00:09.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}}},{"address":"0000:00:0a.0","index":6,"pci":{"driver":"nvidia","address":"0000:00:0a.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}}},{"address":"0000:00:0b.0","index":7,"pci":{"driver":"nvidia","address":"0000:00:0b.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}}},{"address":"0000:00:0c.0","index":8,"pci":{"driver":"nvidia","address":"0000:00:0c.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}}}]},"pci":{"Devices":[{"driver":"","address":"0000:00:00.0","vendor":{"id":"8086","name":"Intel Corporation"},"product":{"id":"1237","name":"440FX - 82441FX PMC [Natoma]"},"revision":"0x02","subsystem":{"id":"1100","name":"Qemu virtual machine"},"class":{"id":"06","name":"Bridge"},"subclass":{"id":"00","name":"Host bridge"},"programming_interface":{"id":"00","name":"unknown"}},{"driver":"","address":"0000:00:01.0","vendor":{"id":"8086","name":"Intel Corporation"},"product":{"id":"7000","name":"82371SB PIIX3 ISA [Natoma/Triton II]"},"revision":"0x00","subsystem":{"id":"1100","name":"Qemu virtual machine"},"class":{"id":"06","name":"Bridge"},"subclass":{"id":"01","name":"ISA bridge"},"programming_interface":{"id":"00","name":"unknown"}},{"driver":"ata_piix","address":"0000:00:01.1","vendor":{"id":"8086","name":"Intel Corporation"},"product":{"id":"7010","name":"82371SB PIIX3 IDE [Natoma/Triton II]"},"revision":"0x00","subsystem":{"id":"1100","name":"Qemu virtual machine"},"class":{"id":"01","name":"Mass storage controller"},"subclass":{"id":"01","name":"IDE interface"},"programming_interface":{"id":"80","name":"ISA Compatibility mode-only controller, supports bus mastering"}},{"driver":"uhci_hcd","address":"0000:00:01.2","vendor":{"id":"8086","name":"Intel Corporation"},"product":{"id":"7020","name":"82371SB PIIX3 USB [Natoma/Triton II]"},"revision":"0x01","subsystem":{"id":"1100","name":"QEMU Virtual Machine"},"class":{"id":"0c","name":"Serial bus controller"},"subclass":{"id":"03","name":"USB controller"},"programming_interface":{"id":"00","name":"UHCI"}},{"driver":"piix4_smbus","address":"0000:00:01.3","vendor":{"id":"8086","name":"Intel Corporation"},"product":{"id":"7113","name":"82371AB/EB/MB PIIX4 ACPI"},"revision":"0x03","subsystem":{"id":"1100","name":"Qemu virtual machine"},"class":{"id":"06","name":"Bridge"},"subclass":{"id":"80","name":"Bridge"},"programming_interface":{"id":"00","name":"unknown"}},{"driver":"cirrus","address":"0000:00:02.0","vendor":{"id":"1013","name":"Cirrus Logic"},"product":{"id":"00b8","name":"GD 5446"},"revision":"0x00","subsystem":{"id":"1100","name":"QEMU Virtual Machine"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}},{"driver":"virtio-pci","address":"0000:00:03.0","vendor":{"id":"1af4","name":"Red Hat, Inc."},"product":{"id":"1000","name":"Virtio network device"},"revision":"0x00","subsystem":{"id":"0001","name":"unknown"},"class":{"id":"02","name":"Network controller"},"subclass":{"id":"00","name":"Ethernet controller"},"programming_interface":{"id":"00","name":"unknown"}},{"driver":"virtio-pci","address":"0000:00:04.0","vendor":{"id":"1af4","name":"Red Hat, Inc."},"product":{"id":"1001","name":"Virtio block device"},"revision":"0x00","subsystem":{"id":"0002","name":"unknown"},"class":{"id":"01","name":"Mass storage controller"},"subclass":{"id":"00","name":"SCSI storage controller"},"programming_interface":{"id":"00","name":"unknown"}},{"driver":"nvidia","address":"0000:00:05.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}},{"driver":"nvidia","address":"0000:00:06.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}},{"driver":"nvidia","address":"0000:00:07.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}},{"driver":"nvidia","address":"0000:00:08.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}},{"driver":"nvidia","address":"0000:00:09.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}},{"driver":"nvidia","address":"0000:00:0a.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}},{"driver":"nvidia","address":"0000:00:0b.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}},{"driver":"nvidia","address":"0000:00:0c.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}},{"driver":"virtio-pci","address":"0000:00:0d.0","vendor":{"id":"1af4","name":"Red Hat, Inc."},"product":{"id":"1002","name":"Virtio memory balloon"},"revision":"0x00","subsystem":{"id":"0005","name":"unknown"},"class":{"id":"00","name":"Unclassified device"},"subclass":{"id":"ff","name":"unknown"},"programming_interface":{"id":"00","name":"unknown"}},{"driver":"virtio-pci","address":"0000:00:0e.0","vendor":{"id":"1af4","name":"Red Hat, Inc."},"product":{"id":"1005","name":"Virtio RNG"},"revision":"0x00","subsystem":{"id":"0004","name":"unknown"},"class":{"id":"00","name":"Unclassified device"},"subclass":{"id":"ff","name":"unknown"},"programming_interface":{"id":"00","name":"unknown"}}]}}
ok this time the cause was bad nvidia drivers which got locked up.
solution was to upgrade the nvidia drivers from 550.54.14 to 550.54.15 and reboot the nodes.
nvidia 550.54.15 fixes this: Fixed a potential corruption when launching kernels on H100 GPUs, which is more likely to occur when the GPU is shared between multiple processes. This may manifest in XID 13 errors such as Graphics Exception: SKEDCHECK11_TOTAL_THREADS. This issue has no user-controllable workaround and is fixable by updating to driver 550.54.15 or higher. 4537349
Refs. https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-54-15/index.html
report looks good now:
Informed the rest of the providers https://discord.com/channels/747885925232672829/1111749248527114322/1221639223640326155
Provider says more GPU's are being Used than actually allocated, see node7 report for example below.
provider 0.5.4 akash network 0.32.2
nvidia-device-plugin 0.15.0-rc.2 ( installed with
helm install nvdp nvdp/nvidia-device-plugin --namespace nvidia-device-plugin --create-namespace --version 0.15.0-rc.2 --devel --set runtimeClassName="nvidia" --set deviceListStrategy=volume-mounts
)node7
Possibly related to: https://github.com/akash-network/support/issues/193 https://github.com/akash-network/support/issues/192