akash-network / support

Akash Support and Issue Tracking
5 stars 3 forks source link

provider misreports the GPU allocation #198

Closed andy108369 closed 3 months ago

andy108369 commented 3 months ago

Provider says more GPU's are being Used than actually allocated, see node7 report for example below.

provider 0.5.4 akash network 0.32.2

nvidia-device-plugin 0.15.0-rc.2 ( installed with helm install nvdp nvdp/nvidia-device-plugin --namespace nvidia-device-plugin --create-namespace --version 0.15.0-rc.2 --devel --set runtimeClassName="nvidia" --set deviceListStrategy=volume-mounts )

$ provider_info2.sh provider.mon.obl.akash.pub
PROVIDER INFO
"hostname"                    "address"
"provider.mon.obl.akash.pub"  "akash1g7az2pus6atgeufgttlcnl0wzlzwd0lrsy6d7s"

Total/Allocatable/Used (t/a/u) per node:
"name"    "cpu(t/a/u)"           "gpu(t/a/u)"  "mem(t/a/u GiB)"          "ephemeral(t/a/u GiB)"
"node1"   "252/248/4"            "8/4/4"       "1417.21/1400.43/16.78"   "5756.74/5706.74/50"
"node10"  "128/115.525/12.475"   "8/8/0"       "472.06/469.75/2.31"      "1482.67/1478.67/4"
"node11"  "128/114.525/13.475"   "8/8/0"       "472.06/469/3.06"         "1482.67/1477.92/4.75"
"node12"  "128/113.425/14.575"   "8/8/0"       "472.06/469.25/2.81"      "1482.67/1478.17/4.5"
"node13"  "128/57.525/70.475"    "8/8/0"       "472.06/456/16.06"        "1482.67/1480.66/2.01"
"node14"  "128/115.025/12.975"   "8/8/0"       "472.06/469.25/2.81"      "1482.67/1478.17/4.5"
"node15"  "128/84.495/43.505"    "8/7/1"       "472.06/437.79/34.27"     "1482.67/1357.91/124.76"
"node16"  "128/121.525/6.475"    "8/8/0"       "472.06/470.75/1.31"      "1482.67/1480.67/2"
"node2"   "252/182.98/69.02"     "8/7/1"       "1417.21/1372.37/44.85"   "5756.74/5636.73/120.01"
"node3"   "252/187.525/64.475"   "8/0/8"       "1417.21/776.9/640.31"    "5756.74/3708.74/2048"
"node4"   "252/2.125/249.875"    "8/0/8"       "1417.21/112.22/1305"     "5756.74/3697.99/2058.75"
"node5"   "252/121.525/130.475"  "8/0/8"       "1417.21/134.9/1282.31"   "5756.74/3706.74/2050"
"node6"   "252/118.025/133.975"  "8/0/8"       "1417.21/1328.15/89.06"   "5756.74/4722.24/1034.5"
"node7"   "252/63.625/188.375"   "8/0/8"       "1417.21/751.97/665.25"   "5756.74/3698.74/2058"
"node8"   "252/123.625/128.375"  "8/0/8"       "1417.21/136.97/1280.25"  "5756.74/3708.74/2048"
"node9"   "128/127.625/0.375"    "8/8/0"       "472.06/471.82/0.25"      "1482.67/1482.67/0"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
613.1         6      197.2       349.77            0             0             0

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"

node7

$ kubectl describe node node7
...
Addresses:
  InternalIP:  10.0.1.214
  Hostname:    node7
Capacity:
  cpu:                252
  ephemeral-storage:  6707082984Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1486158788Ki
  nvidia.com/gpu:     8
  pods:               110
Allocatable:
  cpu:                252
  ephemeral-storage:  6181247667821
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1486056388Ki
  nvidia.com/gpu:     8
  pods:               110

Possibly related to: https://github.com/akash-network/support/issues/193 https://github.com/akash-network/support/issues/192

andy108369 commented 3 months ago

operator-inventory logs

$ kubectl -n akash-services logs deployment/operator-inventory | grep -v 'MODIFIED monitoring CephCluster'
I[2024-03-19|15:40:12.454] using in cluster kube config                 cmp=provider
INFO    nodes.nodes waiting for nodes to finish
INFO    watcher.storageclasses  started
INFO    rest listening on ":8080"
INFO    grpc listening on ":8081"
INFO    watcher.config  started
INFO    nodes.node.monitor  starting    {"node": "node10"}
INFO    nodes.node.monitor  starting    {"node": "node11"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node1"}
INFO    nodes.node.monitor  starting    {"node": "node12"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node12"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node11"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node10"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node13"}
INFO    nodes.node.monitor  starting    {"node": "node1"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node14"}
INFO    nodes.node.monitor  starting    {"node": "node14"}
INFO    nodes.node.monitor  starting    {"node": "node13"}
INFO    nodes.node.monitor  starting    {"node": "node15"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node15"}
INFO    nodes.node.monitor  starting    {"node": "node2"}
INFO    nodes.node.monitor  starting    {"node": "node16"}
INFO    nodes.node.monitor  starting    {"node": "node3"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node2"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node16"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node3"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node4"}
INFO    nodes.node.monitor  starting    {"node": "node5"}
INFO    nodes.node.monitor  starting    {"node": "node4"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node5"}
INFO    nodes.node.monitor  starting    {"node": "node6"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node6"}
INFO    nodes.node.monitor  starting    {"node": "node7"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node7"}
INFO    nodes.node.monitor  starting    {"node": "node8"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node8"}
INFO    nodes.node.monitor  starting    {"node": "node9"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node9"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node6"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node7"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node8"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node4"}
INFO    nodes.node.monitor  started {"node": "node6"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node1"}
INFO    nodes.node.monitor  started {"node": "node8"}
INFO    nodes.node.monitor  started {"node": "node7"}
INFO    nodes.node.monitor  started {"node": "node4"}
INFO    nodes.node.monitor  started {"node": "node1"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node10"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node2"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node9"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node14"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node11"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node3"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node5"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node15"}
INFO    nodes.node.monitor  started {"node": "node2"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node12"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node13"}
INFO    nodes.node.monitor  started {"node": "node11"}
INFO    nodes.node.monitor  started {"node": "node14"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node16"}
INFO    nodes.node.monitor  started {"node": "node5"}
INFO    nodes.node.monitor  started {"node": "node13"}
INFO    nodes.node.monitor  started {"node": "node16"}
INFO    nodes.node.monitor  started {"node": "node9"}
INFO    nodes.node.monitor  started {"node": "node15"}
INFO    nodes.node.monitor  started {"node": "node12"}
INFO    nodes.node.monitor  started {"node": "node3"}
INFO    nodes.node.monitor  started {"node": "node10"}
ERROR   watcher.registry    couldn't query pci.ids  {"error": "Get \"\": unsupported protocol scheme \"\""}
INFO    nodes.node.monitor  successfully applied labels and/or annotations patches for node "node1" {"labels": {}}
INFO    nodes.node.monitor  successfully applied labels and/or annotations patches for node "node1" {"labels": {"akash.network":"true","akash.network/capabilities.gpu.vendor.nvidia.model.h100":"8","akash.network/capabilities.gpu.vendor.nvidia.model.h100.interface.PCIe":"8","akash.network/capabilities.gpu.vendor.nvidia.model.h100.ram.80Gi":"8"}}
ERROR   watcher.registry    couldn't query inventory registry   {"error": "Get \"https://provider-configs.akash.network/devices/gpus\": dial tcp [2606:4700:130:436c:6f75:6466:6c61:7265]:443: connect: network is unreachable"}

after operator-inventory pod restart

$ kubectl rollout restart deployment/operator-inventory -n akash-services
deployment.apps/operator-inventory restarted
$ kubectl -n akash-services logs deployment/operator-inventory | grep -v 'MODIFIED monitoring CephCluster'
I[2024-03-24|12:12:03.135] using in cluster kube config                 cmp=provider
INFO    nodes.nodes waiting for nodes to finish
INFO    rest listening on ":8080"
INFO    watcher.storageclasses  started
INFO    grpc listening on ":8081"
INFO    watcher.config  started
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node12"}
INFO    nodes.node.monitor  starting    {"node": "node1"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node11"}
INFO    nodes.node.monitor  starting    {"node": "node10"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node14"}
INFO    nodes.node.monitor  starting    {"node": "node12"}
INFO    nodes.node.monitor  starting    {"node": "node14"}
INFO    nodes.node.monitor  starting    {"node": "node16"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node16"}
INFO    nodes.node.monitor  starting    {"node": "node2"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node1"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node10"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node13"}
INFO    nodes.node.monitor  starting    {"node": "node11"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node2"}
INFO    nodes.node.monitor  starting    {"node": "node15"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node15"}
INFO    nodes.node.monitor  starting    {"node": "node13"}
INFO    nodes.node.monitor  starting    {"node": "node3"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node3"}
INFO    nodes.node.monitor  starting    {"node": "node5"}
INFO    nodes.node.monitor  starting    {"node": "node4"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node5"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node4"}
INFO    nodes.node.monitor  starting    {"node": "node6"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node6"}
INFO    nodes.node.monitor  starting    {"node": "node7"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node7"}
INFO    nodes.node.monitor  starting    {"node": "node8"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node8"}
INFO    nodes.node.monitor  starting    {"node": "node9"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node9"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node14"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node10"}
ERROR   nodes.node.monitor  unable to query cpu {"error": "error trying to reach service: dial tcp 10.233.91.226:8081: connect: connection refused"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node13"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node11"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node16"}
INFO    nodes.node.monitor  started {"node": "node10"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node5"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node15"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node12"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node4"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node6"}
INFO    nodes.node.monitor  started {"node": "node15"}
INFO    nodes.node.monitor  started {"node": "node12"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node9"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node8"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node1"}
INFO    nodes.node.monitor  started {"node": "node13"}
INFO    nodes.node.monitor  started {"node": "node4"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node3"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node7"}
INFO    nodes.node.monitor  started {"node": "node9"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node2"}
INFO    nodes.node.monitor  started {"node": "node8"}
INFO    nodes.node.monitor  started {"node": "node1"}
INFO    nodes.node.monitor  started {"node": "node7"}
INFO    nodes.node.monitor  started {"node": "node2"}
INFO    nodes.node.monitor  started {"node": "node5"}
INFO    nodes.node.monitor  started {"node": "node14"}
INFO    nodes.node.monitor  started {"node": "node11"}
INFO    nodes.node.monitor  started {"node": "node16"}
INFO    nodes.node.monitor  started {"node": "node6"}
INFO    nodes.node.monitor  started {"node": "node3"}
$ provider_info2.sh provider.mon.obl.akash.pub
PROVIDER INFO
"hostname"                    "address"
"provider.mon.obl.akash.pub"  "akash1g7az2pus6atgeufgttlcnl0wzlzwd0lrsy6d7s"

Total/Allocatable/Used (t/a/u) per node:
"name"    "cpu(t/a/u)"           "gpu(t/a/u)"  "mem(t/a/u GiB)"          "ephemeral(t/a/u GiB)"
"node1"   "252/242/10"           "8/3/5"       "1417.21/1365.43/51.78"   "5756.74/5656.74/100"
"node10"  "128/115.525/12.475"   "8/8/0"       "472.06/469.75/2.31"      "1482.67/1478.67/4"
"node11"  "128/114.525/13.475"   "8/8/0"       "472.06/469/3.06"         "1482.67/1477.92/4.75"
"node12"  "128/113.425/14.575"   "8/8/0"       "472.06/469.25/2.81"      "1482.67/1478.17/4.5"
"node13"  "128/57.525/70.475"    "8/8/0"       "472.06/456/16.06"        "1482.67/1480.66/2.01"
"node14"  "128/115.025/12.975"   "8/8/0"       "472.06/469.25/2.81"      "1482.67/1478.17/4.5"
"node15"  "128/84.495/43.505"    "8/7/1"       "472.06/437.79/34.27"     "1482.67/1357.91/124.76"
"node16"  "128/121.525/6.475"    "8/8/0"       "472.06/470.75/1.31"      "1482.67/1480.67/2"
"node2"   "252/182.98/69.02"     "8/7/1"       "1417.21/1372.37/44.85"   "5756.74/5636.73/120.01"
"node3"   "252/187.525/64.475"   "8/0/8"       "1417.21/776.9/640.31"    "5756.74/3708.74/2048"
"node4"   "252/2.125/249.875"    "8/0/8"       "1417.21/112.22/1305"     "5756.74/3697.99/2058.75"
"node5"   "252/121.525/130.475"  "8/0/8"       "1417.21/134.9/1282.31"   "5756.74/3706.74/2050"
"node6"   "252/118.525/133.475"  "8/0/8"       "1417.21/1328.4/88.81"    "5756.74/4722.24/1034.5"
"node7"   "252/63.625/188.375"   "8/0/8"       "1417.21/751.97/665.25"   "5756.74/3698.74/2058"
"node8"   "252/123.625/128.375"  "8/0/8"       "1417.21/136.97/1280.25"  "5756.74/3708.74/2048"
"node9"   "128/127.125/0.875"    "8/8/0"       "472.06/471.57/0.5"       "1482.67/1482.67/0"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
613.1         6      197.2       349.77            0             0             0

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"

Observation 1

The ERROR "dial tcp 10.233.91.226:8081: connect: connection refused" is wrong because it is accessible. Most likely this indicates that deployment/operator-inventory did not wait enough before attempting to query the operator-inventory-hardware-discovery-node10 pod.

$ kubectl get pods -A -o wide | grep 10.233.91.226
akash-services                                  operator-inventory-hardware-discovery-node10   1/1     Running       0                3m20s   10.233.91.226    node10   <none>           <none>

$ kubectl -n akash-services logs operator-inventory-hardware-discovery-node10
listening on :8081

and I can access it:

$ kubectl exec -ti deployment/operator-inventory -n akash-services -- bash
root@operator-inventory-65d8bd7b-vdsdf:/# curl 10.233.91.226:8081
{"errors":null,"cpu":{"total_cores":128,"total_threads":128,"processors":[{"id":0,"total_cores":64,"total_threads":64,"vendor":"AuthenticAMD","model":"AMD EPYC 7763 64-Core Processor","capabilities":["fpu","vme","de","pse","tsc","msr","pae","mce","cx8","apic","sep","mtrr","pge","mca","cmov","pat","pse36","clflush","mmx","fxsr","sse","sse2","ht","syscall","nx","mmxext","fxsr_opt","pdpe1gb","rdtscp","lm","rep_good","nopl","cpuid","extd_apicid","tsc_known_freq","pni","pclmulqdq","ssse3","fma","cx16","pcid","sse4_1","sse4_2","x2apic","movbe","popcnt","tsc_deadline_timer","aes","xsave","avx","f16c","rdrand","hypervisor","lahf_lm","cmp_legacy","svm","cr8_legacy","abm","sse4a","misalignsse","3dnowprefetch","osvw","perfctr_core","invpcid_single","ssbd","ibrs","ibpb","stibp","vmmcall","fsgsbase","tsc_adjust","bmi1","avx2","smep","bmi2","erms","invpcid","rdseed","adx","smap","clflushopt","clwb","sha_ni","xsaveopt","xsavec","xgetbv1","xsaves","clzero","xsaveerptr","wbnoinvd","arat","npt","nrip_save","umip","pku","ospke","vaes","vpclmulqdq","rdpid","fsrm","arch_capabilities"],"cores":[{"id":0,"total_threads":1,"logical_processors":[0]},{"id":1,"total_threads":1,"logical_processors":[1]},{"id":10,"total_threads":1,"logical_processors":[10]},{"id":11,"total_threads":1,"logical_processors":[11]},{"id":12,"total_threads":1,"logical_processors":[12]},{"id":13,"total_threads":1,"logical_processors":[13]},{"id":14,"total_threads":1,"logical_processors":[14]},{"id":15,"total_threads":1,"logical_processors":[15]},{"id":16,"total_threads":1,"logical_processors":[16]},{"id":17,"total_threads":1,"logical_processors":[17]},{"id":18,"total_threads":1,"logical_processors":[18]},{"id":19,"total_threads":1,"logical_processors":[19]},{"id":2,"total_threads":1,"logical_processors":[2]},{"id":20,"total_threads":1,"logical_processors":[20]},{"id":21,"total_threads":1,"logical_processors":[21]},{"id":22,"total_threads":1,"logical_processors":[22]},{"id":23,"total_threads":1,"logical_processors":[23]},{"id":24,"total_threads":1,"logical_processors":[24]},{"id":25,"total_threads":1,"logical_processors":[25]},{"id":26,"total_threads":1,"logical_processors":[26]},{"id":27,"total_threads":1,"logical_processors":[27]},{"id":28,"total_threads":1,"logical_processors":[28]},{"id":29,"total_threads":1,"logical_processors":[29]},{"id":3,"total_threads":1,"logical_processors":[3]},{"id":30,"total_threads":1,"logical_processors":[30]},{"id":31,"total_threads":1,"logical_processors":[31]},{"id":32,"total_threads":1,"logical_processors":[32]},{"id":33,"total_threads":1,"logical_processors":[33]},{"id":34,"total_threads":1,"logical_processors":[34]},{"id":35,"total_threads":1,"logical_processors":[35]},{"id":36,"total_threads":1,"logical_processors":[36]},{"id":37,"total_threads":1,"logical_processors":[37]},{"id":38,"total_threads":1,"logical_processors":[38]},{"id":39,"total_threads":1,"logical_processors":[39]},{"id":4,"total_threads":1,"logical_processors":[4]},{"id":40,"total_threads":1,"logical_processors":[40]},{"id":41,"total_threads":1,"logical_processors":[41]},{"id":42,"total_threads":1,"logical_processors":[42]},{"id":43,"total_threads":1,"logical_processors":[43]},{"id":44,"total_threads":1,"logical_processors":[44]},{"id":45,"total_threads":1,"logical_processors":[45]},{"id":46,"total_threads":1,"logical_processors":[46]},{"id":47,"total_threads":1,"logical_processors":[47]},{"id":48,"total_threads":1,"logical_processors":[48]},{"id":49,"total_threads":1,"logical_processors":[49]},{"id":5,"total_threads":1,"logical_processors":[5]},{"id":50,"total_threads":1,"logical_processors":[50]},{"id":51,"total_threads":1,"logical_processors":[51]},{"id":52,"total_threads":1,"logical_processors":[52]},{"id":53,"total_threads":1,"logical_processors":[53]},{"id":54,"total_threads":1,"logical_processors":[54]},{"id":55,"total_threads":1,"logical_processors":[55]},{"id":56,"total_threads":1,"logical_processors":[56]},{"id":57,"total_threads":1,"logical_processors":[57]},{"id":58,"total_threads":1,"logical_processors":[58]},{"id":59,"total_threads":1,"logical_processors":[59]},{"id":6,"total_threads":1,"logical_processors":[6]},{"id":60,"total_threads":1,"logical_processors":[60]},{"id":61,"total_threads":1,"logical_processors":[61]},{"id":62,"total_threads":1,"logical_processors":[62]},{"id":63,"total_threads":1,"logical_processors":[63]},{"id":7,"total_threads":1,"logical_processors":[7]},{"id":8,"total_threads":1,"logical_processors":[8]},{"id":9,"total_threads":1,"logical_processors":[9]}]},{"id":1,"total_cores":64,"total_threads":64,"vendor":"AuthenticAMD","model":"AMD EPYC 7763 64-Core Processor","capabilities":["fpu","vme","de","pse","tsc","msr","pae","mce","cx8","apic","sep","mtrr","pge","mca","cmov","pat","pse36","clflush","mmx","fxsr","sse","sse2","ht","syscall","nx","mmxext","fxsr_opt","pdpe1gb","rdtscp","lm","rep_good","nopl","cpuid","extd_apicid","tsc_known_freq","pni","pclmulqdq","ssse3","fma","cx16","pcid","sse4_1","sse4_2","x2apic","movbe","popcnt","tsc_deadline_timer","aes","xsave","avx","f16c","rdrand","hypervisor","lahf_lm","cmp_legacy","svm","cr8_legacy","abm","sse4a","misalignsse","3dnowprefetch","osvw","perfctr_core","invpcid_single","ssbd","ibrs","ibpb","stibp","vmmcall","fsgsbase","tsc_adjust","bmi1","avx2","smep","bmi2","erms","invpcid","rdseed","adx","smap","clflushopt","clwb","sha_ni","xsaveopt","xsavec","xgetbv1","xsaves","clzero","xsaveerptr","wbnoinvd","arat","npt","nrip_save","umip","pku","ospke","vaes","vpclmulqdq","rdpid","fsrm","arch_capabilities"],"cores":[{"id":36,"total_threads":1,"logical_processors":[100]},{"id":37,"total_threads":1,"logical_processors":[101]},{"id":38,"total_threads":1,"logical_processors":[102]},{"id":39,"total_threads":1,"logical_processors":[103]},{"id":40,"total_threads":1,"logical_processors":[104]},{"id":41,"total_threads":1,"logical_processors":[105]},{"id":42,"total_threads":1,"logical_processors":[106]},{"id":43,"total_threads":1,"logical_processors":[107]},{"id":44,"total_threads":1,"logical_processors":[108]},{"id":45,"total_threads":1,"logical_processors":[109]},{"id":46,"total_threads":1,"logical_processors":[110]},{"id":47,"total_threads":1,"logical_processors":[111]},{"id":48,"total_threads":1,"logical_processors":[112]},{"id":49,"total_threads":1,"logical_processors":[113]},{"id":50,"total_threads":1,"logical_processors":[114]},{"id":51,"total_threads":1,"logical_processors":[115]},{"id":52,"total_threads":1,"logical_processors":[116]},{"id":53,"total_threads":1,"logical_processors":[117]},{"id":54,"total_threads":1,"logical_processors":[118]},{"id":55,"total_threads":1,"logical_processors":[119]},{"id":56,"total_threads":1,"logical_processors":[120]},{"id":57,"total_threads":1,"logical_processors":[121]},{"id":58,"total_threads":1,"logical_processors":[122]},{"id":59,"total_threads":1,"logical_processors":[123]},{"id":60,"total_threads":1,"logical_processors":[124]},{"id":61,"total_threads":1,"logical_processors":[125]},{"id":62,"total_threads":1,"logical_processors":[126]},{"id":63,"total_threads":1,"logical_processors":[127]},{"id":0,"total_threads":1,"logical_processors":[64]},{"id":1,"total_threads":1,"logical_processors":[65]},{"id":2,"total_threads":1,"logical_processors":[66]},{"id":3,"total_threads":1,"logical_processors":[67]},{"id":4,"total_threads":1,"logical_processors":[68]},{"id":5,"total_threads":1,"logical_processors":[69]},{"id":6,"total_threads":1,"logical_processors":[70]},{"id":7,"total_threads":1,"logical_processors":[71]},{"id":8,"total_threads":1,"logical_processors":[72]},{"id":9,"total_threads":1,"logical_processors":[73]},{"id":10,"total_threads":1,"logical_processors":[74]},{"id":11,"total_threads":1,"logical_processors":[75]},{"id":12,"total_threads":1,"logical_processors":[76]},{"id":13,"total_threads":1,"logical_processors":[77]},{"id":14,"total_threads":1,"logical_processors":[78]},{"id":15,"total_threads":1,"logical_processors":[79]},{"id":16,"total_threads":1,"logical_processors":[80]},{"id":17,"total_threads":1,"logical_processors":[81]},{"id":18,"total_threads":1,"logical_processors":[82]},{"id":19,"total_threads":1,"logical_processors":[83]},{"id":20,"total_threads":1,"logical_processors":[84]},{"id":21,"total_threads":1,"logical_processors":[85]},{"id":22,"total_threads":1,"logical_processors":[86]},{"id":23,"total_threads":1,"logical_processors":[87]},{"id":24,"total_threads":1,"logical_processors":[88]},{"id":25,"total_threads":1,"logical_processors":[89]},{"id":26,"total_threads":1,"logical_processors":[90]},{"id":27,"total_threads":1,"logical_processors":[91]},{"id":28,"total_threads":1,"logical_processors":[92]},{"id":29,"total_threads":1,"logical_processors":[93]},{"id":30,"total_threads":1,"logical_processors":[94]},{"id":31,"total_threads":1,"logical_processors":[95]},{"id":32,"total_threads":1,"logical_processors":[96]},{"id":33,"total_threads":1,"logical_processors":[97]},{"id":34,"total_threads":1,"logical_processors":[98]},{"id":35,"total_threads":1,"logical_processors":[99]}]}]},"memory":{"total_physical_bytes":515396075520,"total_usable_bytes":506978603008,"supported_page_sizes":[1073741824,2097152],"modules":null},"gpu":{"cards":[{"address":"0000:00:02.0","index":0,"pci":{"driver":"cirrus","address":"0000:00:02.0","vendor":{"id":"1013","name":"Cirrus Logic"},"product":{"id":"00b8","name":"GD 5446"},"revision":"0x00","subsystem":{"id":"1100","name":"QEMU Virtual Machine"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}}},{"address":"0000:00:05.0","index":1,"pci":{"driver":"nvidia","address":"0000:00:05.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}}},{"address":"0000:00:06.0","index":2,"pci":{"driver":"nvidia","address":"0000:00:06.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}}},{"address":"0000:00:07.0","index":3,"pci":{"driver":"nvidia","address":"0000:00:07.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}}},{"address":"0000:00:08.0","index":4,"pci":{"driver":"nvidia","address":"0000:00:08.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}}},{"address":"0000:00:09.0","index":5,"pci":{"driver":"nvidia","address":"0000:00:09.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}}},{"address":"0000:00:0a.0","index":6,"pci":{"driver":"nvidia","address":"0000:00:0a.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}}},{"address":"0000:00:0b.0","index":7,"pci":{"driver":"nvidia","address":"0000:00:0b.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}}},{"address":"0000:00:0c.0","index":8,"pci":{"driver":"nvidia","address":"0000:00:0c.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}}}]},"pci":{"Devices":[{"driver":"","address":"0000:00:00.0","vendor":{"id":"8086","name":"Intel Corporation"},"product":{"id":"1237","name":"440FX - 82441FX PMC [Natoma]"},"revision":"0x02","subsystem":{"id":"1100","name":"Qemu virtual machine"},"class":{"id":"06","name":"Bridge"},"subclass":{"id":"00","name":"Host bridge"},"programming_interface":{"id":"00","name":"unknown"}},{"driver":"","address":"0000:00:01.0","vendor":{"id":"8086","name":"Intel Corporation"},"product":{"id":"7000","name":"82371SB PIIX3 ISA [Natoma/Triton II]"},"revision":"0x00","subsystem":{"id":"1100","name":"Qemu virtual machine"},"class":{"id":"06","name":"Bridge"},"subclass":{"id":"01","name":"ISA bridge"},"programming_interface":{"id":"00","name":"unknown"}},{"driver":"ata_piix","address":"0000:00:01.1","vendor":{"id":"8086","name":"Intel Corporation"},"product":{"id":"7010","name":"82371SB PIIX3 IDE [Natoma/Triton II]"},"revision":"0x00","subsystem":{"id":"1100","name":"Qemu virtual machine"},"class":{"id":"01","name":"Mass storage controller"},"subclass":{"id":"01","name":"IDE interface"},"programming_interface":{"id":"80","name":"ISA Compatibility mode-only controller, supports bus mastering"}},{"driver":"uhci_hcd","address":"0000:00:01.2","vendor":{"id":"8086","name":"Intel Corporation"},"product":{"id":"7020","name":"82371SB PIIX3 USB [Natoma/Triton II]"},"revision":"0x01","subsystem":{"id":"1100","name":"QEMU Virtual Machine"},"class":{"id":"0c","name":"Serial bus controller"},"subclass":{"id":"03","name":"USB controller"},"programming_interface":{"id":"00","name":"UHCI"}},{"driver":"piix4_smbus","address":"0000:00:01.3","vendor":{"id":"8086","name":"Intel Corporation"},"product":{"id":"7113","name":"82371AB/EB/MB PIIX4 ACPI"},"revision":"0x03","subsystem":{"id":"1100","name":"Qemu virtual machine"},"class":{"id":"06","name":"Bridge"},"subclass":{"id":"80","name":"Bridge"},"programming_interface":{"id":"00","name":"unknown"}},{"driver":"cirrus","address":"0000:00:02.0","vendor":{"id":"1013","name":"Cirrus Logic"},"product":{"id":"00b8","name":"GD 5446"},"revision":"0x00","subsystem":{"id":"1100","name":"QEMU Virtual Machine"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}},{"driver":"virtio-pci","address":"0000:00:03.0","vendor":{"id":"1af4","name":"Red Hat, Inc."},"product":{"id":"1000","name":"Virtio network device"},"revision":"0x00","subsystem":{"id":"0001","name":"unknown"},"class":{"id":"02","name":"Network controller"},"subclass":{"id":"00","name":"Ethernet controller"},"programming_interface":{"id":"00","name":"unknown"}},{"driver":"virtio-pci","address":"0000:00:04.0","vendor":{"id":"1af4","name":"Red Hat, Inc."},"product":{"id":"1001","name":"Virtio block device"},"revision":"0x00","subsystem":{"id":"0002","name":"unknown"},"class":{"id":"01","name":"Mass storage controller"},"subclass":{"id":"00","name":"SCSI storage controller"},"programming_interface":{"id":"00","name":"unknown"}},{"driver":"nvidia","address":"0000:00:05.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}},{"driver":"nvidia","address":"0000:00:06.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}},{"driver":"nvidia","address":"0000:00:07.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}},{"driver":"nvidia","address":"0000:00:08.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}},{"driver":"nvidia","address":"0000:00:09.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}},{"driver":"nvidia","address":"0000:00:0a.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}},{"driver":"nvidia","address":"0000:00:0b.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}},{"driver":"nvidia","address":"0000:00:0c.0","vendor":{"id":"10de","name":"NVIDIA Corporation"},"product":{"id":"2230","name":"GA102GL [RTX A6000]"},"revision":"0xa1","subsystem":{"id":"1459","name":"unknown"},"class":{"id":"03","name":"Display controller"},"subclass":{"id":"00","name":"VGA compatible controller"},"programming_interface":{"id":"00","name":"VGA controller"}},{"driver":"virtio-pci","address":"0000:00:0d.0","vendor":{"id":"1af4","name":"Red Hat, Inc."},"product":{"id":"1002","name":"Virtio memory balloon"},"revision":"0x00","subsystem":{"id":"0005","name":"unknown"},"class":{"id":"00","name":"Unclassified device"},"subclass":{"id":"ff","name":"unknown"},"programming_interface":{"id":"00","name":"unknown"}},{"driver":"virtio-pci","address":"0000:00:0e.0","vendor":{"id":"1af4","name":"Red Hat, Inc."},"product":{"id":"1005","name":"Virtio RNG"},"revision":"0x00","subsystem":{"id":"0004","name":"unknown"},"class":{"id":"00","name":"Unclassified device"},"subclass":{"id":"ff","name":"unknown"},"programming_interface":{"id":"00","name":"unknown"}}]}}
andy108369 commented 3 months ago

ok this time the cause was bad nvidia drivers which got locked up.

solution was to upgrade the nvidia drivers from 550.54.14 to 550.54.15 and reboot the nodes.

nvidia 550.54.15 fixes this: Fixed a potential corruption when launching kernels on H100 GPUs, which is more likely to occur when the GPU is shared between multiple processes. This may manifest in XID 13 errors such as Graphics Exception: SKEDCHECK11_TOTAL_THREADS. This issue has no user-controllable workaround and is fixable by updating to driver 550.54.15 or higher. 4537349

Refs. https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-54-15/index.html

report looks good now: image

TODO

andy108369 commented 3 months ago

Informed the rest of the providers https://discord.com/channels/747885925232672829/1111749248527114322/1221639223640326155