Closed andy108369 closed 2 months ago
Have also asked Netdata to add the alert on "GPU has fallen off the bus" message https://github.com/netdata/netdata/discussions/17331
GPU is back after reboot:
root@node1:~# nvidia-smi
Sat Apr 6 22:27:30 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 Off | Off |
| 0% 27C P8 16W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 Off | 00000000:25:00.0 Off | Off |
| 0% 26C P8 9W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off |
| 0% 25C P8 11W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce RTX 4090 Off | 00000000:61:00.0 Off | Off |
| 0% 27C P8 8W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA GeForce RTX 4090 Off | 00000000:81:00.0 Off | Off |
| 0% 25C P8 13W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA GeForce RTX 4090 Off | 00000000:A1:00.0 Off | Off |
| 0% 25C P8 14W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA GeForce RTX 4090 Off | 00000000:C1:00.0 Off | Off |
| 0% 25C P8 14W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA GeForce RTX 4090 Off | 00000000:E1:00.0 Off | Off |
| 0% 25C P8 11W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
root@node1:~# dmesg -T | grep NVRAM
root@node1:~#
null
-> because it wants 2 GPU's but provider can't offer only 1 GPU; hence it can't find a single node to offer 2 GPU's.root@node1:~# kubectl get pods -A --sort-by='{.metadata.creationTimestamp}' -o wide |grep -vw Running
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
rook-ceph rook-ceph-osd-prepare-node2-bhdvz 0/1 Completed 0 4m5s 10.233.75.42 node2 <none> <none>
rook-ceph rook-ceph-osd-prepare-node3-7j49n 0/1 Completed 0 4m2s 10.233.71.10 node3 <none> <none>
eogo7cjtebo8fr0g9l1mfmo17j5r3efi1hitlpki3g04g service-1-84988c5fb6-9fm2t 0/1 Pending 0 2m10s <none> <none> <none> <none>
root@node1:~# pods_json=$(kubectl get pods -A --sort-by='{.metadata.creationTimestamp}' -o json)
( echo -e "NAMESPACE\tNAME\t\t\t\tREADY\tSTATUS\tRESTARTS\tAGE\tGPU\tNODE"; echo "$pods_json" | jq -r '.items[] | select(.spec.containers[].resources.requests."nvidia.com/gpu" != null) | "\(.metadata.namespace)\t\(.metadata.name)\t\(.status.containerStatuses[0].ready)/\(.spec.containers | length)\t\(.status.phase)\t\(.status.containerStatuses[0].restartCount)\t\(.metadata.creationTimestamp)\tYES\t\(.spec.nodeName)"' ) | column -t
NAMESPACE NAME READY STATUS RESTARTS AGE GPU NODE
rfij4esvggf9cqqnpf2hq266o0nba01t5iq918bu1v9iu service-1-58bf676fdc-d9hqt true/1 Running 0 2024-04-04T18:06:33Z YES node2
1qhlsoi0sqj2rot1otov7vhfao2j0cnmbuvkj2qd16ese service-1-76f8b9cf6d-pfl8g true/1 Running 1 2024-04-04T18:06:33Z YES node2
d62adnou0v7b5s7h3t8gnh0av540fcok9bk56u72f3je2 service-1-7cffd45f48-mjmc4 true/1 Running 1 2024-04-04T18:06:33Z YES node2
guf6r2fhenpfljbhncip9sbei3ss43av4kaau95kl4rpq service-1-598c857c89-7xmw7 true/1 Running 0 2024-04-04T18:06:33Z YES node2
qg9lq6q8tcta1p2m9fuc1pdbjfispht8q7e7iun6t5s2e service-1-59974dfd89-qjpzg true/1 Running 1 2024-04-04T18:06:33Z YES node2
6ng2gu6vf5p8qg5bde5udse1e34igb1bn15kaeupiuhva service-1-59c44cd758-bgk9n true/1 Running 0 2024-04-04T18:06:34Z YES node2
rpurv2kf5qurt2ibliluk6e4sr627d0uopbtpqbuh1b6m service-1-5f48b8bbc4-dfdch true/1 Running 0 2024-04-04T18:16:00Z YES node2
a3me6oo1kknteim3e4g0md5eanhvpeib3h42bp7au4slm service-1-7f97dc6b94-2qxcc true/1 Running 0 2024-04-04T18:17:43Z YES node2
of3uincqjlja5ekk8cbfuormpm4dmn8v403c535f4dc4m service-1-7dd5dffbc4-tqdh5 true/1 Running 0 2024-04-06T21:58:39Z YES node3
3n3mvl6qqh1bkk41pou3dkfttkjglo6tua36udmh6n4fm service-1-dd746bf44-t22qv true/1 Running 0 2024-04-06T21:58:39Z YES node3
q79th1d7qcblh9paf7ivfnkpg3d8fd5uaa7ifn4ojt8qc service-1-7bccf77c78-rxfvs true/1 Running 0 2024-04-06T21:58:41Z YES node3
1jr4gic7tq72avp86id8ck2r0m6ppn5g9qgauqtibhtlk service-1-5f9b59549b-ftwm6 true/1 Running 0 2024-04-06T21:58:41Z YES node3
mru7v16b64p70p7gd15tav6nkot7729vdeaogvnljc7qq service-1-7f694ffdf5-qlbjz true/1 Running 0 2024-04-06T21:58:41Z YES node3
qrka4mab6esns6blt8jaeos663j0e6sbp9cfghi02jc4i service-1-84c67b446-kmdm4 true/1 Running 0 2024-04-06T21:59:13Z YES node3
dq0albqkms9kennoo5j6h01p18banva2bco1gbq2ddriq service-1-7f598bbc5c-rnwb6 true/1 Running 0 2024-04-06T21:59:46Z YES node3
tsu3ue9housp0ehjsr51psu4aambpaqvtpuninpl07hqs service-1-55fd66f6f5-l2bf5 true/1 Running 0 2024-04-06T21:59:53Z YES node3
3umvk5ct5vuq4fl2h3o56kslcfe6gh3jse6klk64vpa2k service-1-775cc5f59d-bvfv9 true/1 Running 0 2024-04-06T22:15:23Z YES node1
7gtgt1h8rr21k08c4n4t5lhhf06ggrv2uervfl13h5v56 service-1-5fcf94fc4c-vsp2x true/1 Running 0 2024-04-06T22:15:23Z YES node1
g5i1ml6bhnfso9faglp1gv167f8acegsv335hlkq0dlfc service-1-7d66cdd98c-7z7hj true/1 Running 0 2024-04-06T22:15:24Z YES node1
feb2pachqnknvuhjrgqb0pvve0egut5t7d00ur9u30nd8 service-1-6cd8f6b6f4-znxt5 true/1 Running 0 2024-04-06T22:25:19Z YES node1
ht63ne5q8esd6kh79uts05u39e03t7gfp86mbqin1shpk service-1-858bc896b4-v2xsf true/1 Running 0 2024-04-06T22:25:24Z YES node1
mp60hei3dsq8lnn3k21u1bfv87pjmei31dobelrmlvqas service-1-84897bbdb4-ckrqp true/1 Running 0 2024-04-06T22:25:43Z YES node1
6pgohd98lm7gs5rb2kv5bnc4c9920jtvfvmg4ikqvhn8a service-1-55858fc545-msrts true/1 Running 0 2024-04-06T22:26:11Z YES node1
eogo7cjtebo8fr0g9l1mfmo17j5r3efi1hitlpki3g04g service-1-84988c5fb6-9fm2t null/1 Pending null 2024-04-06T22:28:30Z YES null
root@node1:~#
root@node1:~# ( echo -e "NAMESPACE\tNAME\t\t\t\tREADY\tSTATUS\tRESTARTS\tAGE\tGPU\tNODE"; echo "$pods_json" | jq -r '.items[] | select(.spec.containers[].resources.requests."nvidia.com/gpu" != null) | "\(.metadata.namespace)\t\(.metadata.name)\t\(.status.containerStatuses[0].ready)/\(.spec.containers | length)\t\(.status.phase)\t\(.status.containerStatuses[0].restartCount)\t\(.metadata.creationTimestamp)\tYES\t\(.spec.nodeName)"' ) | column -t | grep -v ^NAMESPACE | wc -l
24
root@node1:~# kubectl -n $ns describe pod service-1-84988c5fb6-9fm2t
Name: service-1-84988c5fb6-9fm2t
Namespace: eogo7cjtebo8fr0g9l1mfmo17j5r3efi1hitlpki3g04g
Priority: 0
Runtime Class Name: nvidia
Service Account: default
Node: <none>
Labels: akash.network=true
akash.network/manifest-service=service-1
akash.network/namespace=eogo7cjtebo8fr0g9l1mfmo17j5r3efi1hitlpki3g04g
pod-template-hash=84988c5fb6
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/service-1-84988c5fb6
Containers:
service-1:
Image: 0lav/nimble-miner-public
Ports: 22/TCP, 80/TCP
Host Ports: 0/TCP, 0/TCP
Command:
bash
-c
Args:
apt-get update ; apt-get upgrade -y ; apt install -y ssh; echo "PermitRootLogin yes" >> /etc/ssh/sshd_config ; (echo $SSH_PASS; echo $SSH_PASS) | passwd root ; service ssh start; echo ==== ssh user:"root" === ; echo === ssh pass:"$SSH_PASS" === ; sleep infinity
Limits:
cpu: 16
ephemeral-storage: 120G
memory: 16G
nvidia.com/gpu: 1
Requests:
cpu: 16
ephemeral-storage: 120G
memory: 16G
nvidia.com/gpu: 1
Environment:
SSH_PASS: REDACTED
AKASH_GROUP_SEQUENCE: 1
AKASH_DEPLOYMENT_SEQUENCE: 15732906
AKASH_ORDER_SEQUENCE: 1
AKASH_OWNER: akash1rpcl7spcemj9w0qyd4sweqa9chh9j3y622e2lp
AKASH_PROVIDER: akash1t0sk5nhc8n3xply5ft60x9det0s7jwplzzycnv
AKASH_CLUSTER_PUBLIC_HOSTNAME: provider.pdx.nb.akash.pub
Mounts: <none>
Conditions:
Type Status
PodScheduled False
Volumes: <none>
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m30s default-scheduler 0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
root@node1:~#
I've asked the owner to redeploy that dseq 15732906
(close & redeploy anew)
node1.pdx.nb.akash.pub:/root/nvidia-bug-report.log.gz
)Issue reoccurred
The issue reoccurred 3rd time:
root@node1:~# uptime
08:38:19 up 15:34, 1 user, load average: 6.49, 7.16, 7.32
root@node1:~# dmesg -T | grep NVRM
[Mon Apr 8 17:04:05 2024] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.161.07 Sat Feb 17 22:55:48 UTC 2024
[Tue Apr 9 01:21:38 2024] NVRM: GPU at PCI:0000:a1:00: GPU-979426f2-893a-7cbb-c4cf-81472f89a462
[Tue Apr 9 01:21:38 2024] NVRM: Xid (PCI:0000:a1:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[Tue Apr 9 01:21:38 2024] NVRM: GPU 0000:a1:00.0: GPU has fallen off the bus.
[Tue Apr 9 01:21:38 2024] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
I've cordoned this node1.pdx.nb so it won't be participating in the provider's resource scheduling until it gets fixed by the provider.
Bug report submitted with the GPU crash dump data => https://forums.developer.nvidia.com/t/xid-79-error-gpu-falls-off-bus-with-nvidia-driver-535-161-07-on-ubuntu-22-04-lts-server/288976
NebulaBlock is going to replace the node1.pdx.nb.akash.pub server from 9:30am to 11:30am PT time in order to fix the 4090 GPU issue.
I've scaled the akash-provider service down until that's complete. This will inevitably drop the total 4090 GPU count by 24 in the stats page https://akash.network/gpus/ until the provider is back up again.
https://discord.com/channels/747885925232672829/1111749348351553587/1227292077369589842
The node1.pdx.nb.akash.pub server has been successfully replaced - the server (mainboard), the 8x 4090's GPU's and 1x 1.75T disk (used for ceph)
2x 7T (raid1) disks (rootfs) - were kept.
Good news: rook-ceph (Akash's persistent storage) picked up the new 1.75T disk on the new node1.pdx correctly!
$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 5.23979 root default
-5 1.74660 host node1
0 nvme 1.74660 osd.0 up 1.00000 1.00000
-7 1.74660 host node2
1 nvme 1.74660 osd.1 up 1.00000 1.00000
-3 1.74660 host node3
2 nvme 1.74660 osd.2 up 1.00000 1.00000
it is currently copying the replicas (pg's) to it :slight_smile:
I've updated the nvidia ticket: https://forums.developer.nvidia.com/t/xid-79-error-rtx-4090-gpu-falls-off-bus-with-nvidia-driver-535-161-07-on-ubuntu-22-04-lts-server/288976/2?u=andrey.arapov
Will reopen this issue if it reoccurs.
Reason - GPU issue on
node1
node1
node2
node3