Closed lianziqt closed 3 months ago
看了pkg/util/nodelock/nodelock.go
的代码,感觉奇怪的是,hami-device-plugin
的日志内,和锁相关的日志只有Node lock not set
,没见到其他日志
kubectl describe pods pdf-model-deploy-fdfc65bfc-xg9sp -n volc-content-mng
Name: pdf-model-deploy-fdfc65bfc-xg9sp
Namespace: volc-content-mng
Priority: 0
Node: 192.168.0.5/
Start Time: Tue, 04 Jun 2024 16:17:59 +0800
Labels: app=pdf-model-deploy
pod-template-hash=fdfc65bfc
Annotations: consul.register/enabled: true
consul.register/enabled.podip: true
consul.register/port.9899: caijing.algo.pdf_onnx_model
Status: Failed
Reason: UnexpectedAdmissionError
Message: Pod Allocate failed due to rpc error: code = Unknown desc = no binding pod found on node 192.168.0.5, which is unexpected
IP:
IPs: <none>
Controlled By: ReplicaSet/pdf-model-deploy-fdfc65bfc
Containers:
pdf-model-deploy-main:
Image: dockerhub.vpc.com:5000/volc_cms/volc.content.pdf_model_deploy:1.0.0.16
Port: 9899/TCP
Host Port: 0/TCP
Command:
bash
/opt/tiger/caijing_pdf_model_deploy/deploy_pdf_model_service.sh
v2
Limits:
cpu: 4
memory: 12Gi
nvidia.com/gpu: 1
Requests:
cpu: 1
memory: 4Gi
nvidia.com/gpu: 1
Environment:
OnPremise: 1
IS_ON_PREMISE: true
HOST_IP: (v1:status.hostIP)
MY_HOST_IP: (v1:status.hostIP)
RUNTIME_IDC_NAME: pri
POD_IP: (v1:status.podIP)
CONSUL_HTTP_HOST: (v1:status.hostIP)
PORT: 9899
USE_P2P: false
ENABLE_MPS: false
GPUS: 1
USE_MULTI_PROCESS: false
NUM_WORKERS_PER_GROUP: 1
CONF: /opt/tiger/caijing_pdf_model_deploy/conf/rpc.conf
LOG4J_CONF: /opt/tiger/caijing_pdf_model_deploy/conf/log4j.xml
LOG4J_CHILD_CONF: /opt/tiger/caijing_pdf_model_deploy/conf/log4j_child.xml
ENABLE_TRACE: false
ENABLE_DYNAMIC_CONFIG: false
ENABLE_REPORT_QPS: false
DYNAMIC_GPU_NUM: 1
QS_RUNTIME_ENABLE_WARM_UP: 0
Mounts:
/opt/tmp/sock from sock (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-9t8sb (ro)
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: pdf-model-deploy-config
Optional: false
sock:
Type: HostPath (bare host directory volume)
Path: /opt/tmp/sock
HostPathType:
default-token-9t8sb:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-9t8sb
Optional: false
QoS Class: Burstable
Node-Selectors: machine=gpu
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 31s default-scheduler Successfully assigned volc-content-mng/pdf-model-deploy-fdfc65bfc-xg9sp to 192.168.0.5
Warning UnexpectedAdmissionError 31s kubelet Allocate failed due to rpc error: code = Unknown desc = no binding pod found on node 192.168.0.5, which is unexpected
kubectl describe nodes 192.168.0.5
Name: 192.168.0.5
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
elasticsearch=true
gpu=on
kubernetes.io/arch=amd64
kubernetes.io/hostname=192.168.0.5
kubernetes.io/os=linux
machine=gpu
node-role.kubernetes.io/worker=true
shard-replica=1-1
Annotations: flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"5a:2a:50:05:fb:ea"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 192.168.0.5
hami.io/node-handshake: Requesting_2024.06.04 08:23:02
hami.io/node-handshake-dcu: Requesting_2024.05.28 13:01:11
hami.io/node-nvidia-register:
GPU-14637886-1d37-61c4-a71e-fb5ca495ef65,10,24258,100,NVIDIA-NVIDIA A30,0,true:GPU-4edca201-9195-cf92-98c7-e7c2d627742b,10,24258,100,NVIDI...
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 192.168.0.5/24
projectcalico.org/IPv4IPIPTunnelAddr: 10.42.3.1
rke.cattle.io/external-ip: 192.168.0.5
rke.cattle.io/internal-ip: 192.168.0.5
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 01 Apr 2024 16:06:14 +0800
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: 192.168.0.5
AcquireTime: <unset>
RenewTime: Tue, 04 Jun 2024 16:23:22 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Mon, 01 Apr 2024 16:06:38 +0800 Mon, 01 Apr 2024 16:06:38 +0800 FlannelIsUp Flannel is running on this node
MemoryPressure False Tue, 04 Jun 2024 16:19:33 +0800 Mon, 01 Apr 2024 16:06:14 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 04 Jun 2024 16:19:33 +0800 Mon, 01 Apr 2024 16:06:14 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 04 Jun 2024 16:19:33 +0800 Mon, 01 Apr 2024 16:06:14 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 04 Jun 2024 16:19:33 +0800 Mon, 01 Apr 2024 16:06:45 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.0.5
Hostname: 192.168.0.5
Capacity:
cpu: 56
ephemeral-storage: 154502324Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 482970120Ki
nvidia.com/gpu: 20
pods: 110
Allocatable:
cpu: 54
ephemeral-storage: 140241857915
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 479312392Ki
nvidia.com/gpu: 20
pods: 110
System Info:
Machine ID:
System UUID: 000C65FA-CB33-0000-0009-80D401443FEB
Boot ID: 3741823c-3c9c-49ad-a9ef-dd3e77e5cb59
Kernel Version: 3.10.0-1160.102.1.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://20.10.17
Kubelet Version: v1.19.16
Kube-Proxy Version: v1.19.16
PodCIDR: 10.42.3.0/24
PodCIDRs: 10.42.3.0/24
pods with 'privileged:true' can't be scheduled, because there's no way to limit its GPU resource
1. Issue or feature description
想请教下一些HAMi使用上的问题,目前我在带有GPU的节点安装HAMi后,想拉起一个deployment时,一直失败,pod的状态显示
UnexpectdAdmissionError
,使用kubectl logs -f -n kube-system hami-device-plugin-tlxs8 -c device-plugin
查看hami插件的日志显示Node lock not set
,具体如下2. Steps to reproduce the issue
kubectl get pods -n kube-system
,看到使用kubectl decribe pods 192.168.0.5 也能看到GPU数量符合预期(2 * 10),192.168.0.5为节点名
I0531 09:46:29.283565 3260597 register.go:131] MemoryScaling= 1 registeredmem= 24258 I0531 09:46:29.325813 3260597 register.go:83] "current card has not established numa topology" gpu row info=["GPU0"," X ","SYS","0-55","N/A"] index=0 I0531 09:46:29.325865 3260597 register.go:159] nvml registered device id=1, memory=24258, type=NVIDIA A30, numa=0 I0531 09:46:29.326031 3260597 register.go:131] MemoryScaling= 1 registeredmem= 24258 I0531 09:46:29.336827 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:29.373783 3260597 register.go:83] "current card has not established numa topology" gpu row info=["GPU1","SYS"," X ","0-55","N/A"] index=1 I0531 09:46:29.373804 3260597 register.go:159] nvml registered device id=2, memory=24258, type=NVIDIA A30, numa=0 I0531 09:46:29.373819 3260597 register.go:166] "start working on the devices" devices=[{"Index":0,"Id":"GPU-14637886-1d37-61c4-a71e-fb5ca495ef65","Count":10,"Devmem":24258,"Devcore":100,"Type":"NVIDIA-NVIDIA A30","Numa":0,"Health":true},{"Index":0,"Id":"GPU-4edca201-9195-cf92-98c7-e7c2d627742b","Count":10,"Devmem":24258,"Devcore":100,"Type":"NVIDIA-NVIDIA A30","Numa":0,"Health":true}] I0531 09:46:29.376915 3260597 util.go:128] Encoded node Devices: GPU-14637886-1d37-61c4-a71e-fb5ca495ef65,10,24258,100,NVIDIA-NVIDIA A30,0,true:GPU-4edca201-9195-cf92-98c7-e7c2d627742b,10,24258,100,NVIDIA-NVIDIA A30,0,true: I0531 09:46:29.376943 3260597 register.go:176] patch node with the following annos map[hami.io/node-handshake:Reported 2024-05-31 09:46:29.376930027 +0000 UTC m=+170072.375319578 hami.io/node-nvidia-register:GPU-14637886-1d37-61c4-a71e-fb5ca495ef65,10,24258,100,NVIDIA-NVIDIA A30,0,true:GPU-4edca201-9195-cf92-98c7-e7c2d627742b,10,24258,100,NVIDIA-NVIDIA A30,0,true:] I0531 09:46:29.383974 3260597 register.go:196] Successfully registered annotation. Next check in 30s seconds... I0531 09:46:29.794603 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-14637886-1d37-61c4-a71e-fb5ca495ef65-3],}] I0531 09:46:29.924314 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:30.397174 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-4edca201-9195-cf92-98c7-e7c2d627742b-8],}] I0531 09:46:30.534171 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:30.996050 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-14637886-1d37-61c4-a71e-fb5ca495ef65-4],}] I0531 09:46:31.128008 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:31.596925 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-14637886-1d37-61c4-a71e-fb5ca495ef65-9],}] I0531 09:46:31.728817 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:32.195546 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-14637886-1d37-61c4-a71e-fb5ca495ef65-4],}] I0531 09:46:32.327801 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:32.797415 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-14637886-1d37-61c4-a71e-fb5ca495ef65-6],}] I0531 09:46:32.929690 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:33.395637 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-4edca201-9195-cf92-98c7-e7c2d627742b-4],}] I0531 09:46:33.533039 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:33.995417 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-14637886-1d37-61c4-a71e-fb5ca495ef65-7],}] I0531 09:46:34.128932 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:34.796967 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-4edca201-9195-cf92-98c7-e7c2d627742b-8],}] I0531 09:46:34.929314 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:35.398659 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-14637886-1d37-61c4-a71e-fb5ca495ef65-1],}] I0531 09:46:35.534839 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:35.995026 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-4edca201-9195-cf92-98c7-e7c2d627742b-5],}] I0531 09:46:36.133031 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:36.996326 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-4edca201-9195-cf92-98c7-e7c2d627742b-1],}] I0531 09:46:37.132472 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5"
I0531 09:44:59.082870 1 util.go:128] Encoded node Devices: GPU-14637886-1d37-61c4-a71e-fb5ca495ef65,10,24258,100,NVIDIA-NVIDIA A30,0,true:GPU-4edca201-9195-cf92-98c7-e7c2d627742b,10,24258,100,NVIDIA-NVIDIA A30,0,true: W0531 09:44:59.082892 1 scheduler.go:325] get node 192.168.0.238 device error, node 192.168.0.238 not found W0531 09:44:59.082898 1 scheduler.go:325] get node 192.168.0.237 device error, node 192.168.0.237 not found W0531 09:44:59.082902 1 scheduler.go:325] get node 192.168.0.32 device error, node 192.168.0.32 not found W0531 09:44:59.083412 1 scheduler.go:325] get node 192.168.0.238 device error, node 192.168.0.238 not found W0531 09:44:59.083426 1 scheduler.go:325] get node 192.168.0.237 device error, node 192.168.0.237 not found W0531 09:44:59.083430 1 scheduler.go:325] get node 192.168.0.32 device error, node 192.168.0.32 not found I0531 09:45:00.124588 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:00.125001 1 webhook.go:63] Processing admission hook for pod datafinder/, UID: 85ab3c40-28fa-4585-b63c-0fce5b15f15d I0531 09:45:00.125016 1 webhook.go:84] Processing admission hook for pod datafinder/, UID: 85ab3c40-28fa-4585-b63c-0fce5b15f15d - Allowing admission for pod: no resource found I0531 09:45:00.818481 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:00.818883 1 webhook.go:63] Processing admission hook for pod datafinder/, UID: 5f7610c6-8cf4-4fa1-8e99-3ea18af2a3ef I0531 09:45:00.818899 1 webhook.go:84] Processing admission hook for pod datafinder/, UID: 5f7610c6-8cf4-4fa1-8e99-3ea18af2a3ef - Allowing admission for pod: no resource found I0531 09:45:01.672039 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:01.672443 1 webhook.go:63] Processing admission hook for pod datarangers/, UID: c1df9597-831e-443e-b03b-d4e8ac4a44ea I0531 09:45:01.672489 1 webhook.go:84] Processing admission hook for pod datarangers/, UID: c1df9597-831e-443e-b03b-d4e8ac4a44ea - Allowing admission for pod: no resource found I0531 09:45:01.917897 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:01.918296 1 webhook.go:63] Processing admission hook for pod datarangers/, UID: 5c8aad5d-0233-4f23-b700-43eca87a7b96 I0531 09:45:01.918312 1 webhook.go:84] Processing admission hook for pod datarangers/, UID: 5c8aad5d-0233-4f23-b700-43eca87a7b96 - Allowing admission for pod: no resource found I0531 09:45:02.772161 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:02.772517 1 webhook.go:63] Processing admission hook for pod minio/, UID: 586170e7-8fd5-4050-8967-944b1ac4a674 I0531 09:45:02.772533 1 webhook.go:84] Processing admission hook for pod minio/, UID: 586170e7-8fd5-4050-8967-944b1ac4a674 - Allowing admission for pod: no resource found I0531 09:45:03.728881 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:03.729449 1 webhook.go:63] Processing admission hook for pod volc-content-bigdata/, UID: e11888b8-5b9f-47c4-b10f-5eae51b5a092 I0531 09:45:03.729465 1 webhook.go:84] Processing admission hook for pod volc-content-bigdata/, UID: e11888b8-5b9f-47c4-b10f-5eae51b5a092 - Allowing admission for pod: no resource found W0531 09:45:12.202457 1 scheduler.go:325] get node 192.168.0.32 device error, node 192.168.0.32 not found W0531 09:45:12.202482 1 scheduler.go:325] get node 192.168.0.238 device error, node 192.168.0.238 not found W0531 09:45:12.202486 1 scheduler.go:325] get node 192.168.0.237 device error, node 192.168.0.237 not found I0531 09:45:14.218465 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:14.218859 1 webhook.go:63] Processing admission hook for pod datafinder/, UID: f346beef-d152-4d32-8f27-4270bd1dc722 I0531 09:45:14.218874 1 webhook.go:84] Processing admission hook for pod datafinder/, UID: f346beef-d152-4d32-8f27-4270bd1dc722 - Allowing admission for pod: no resource found I0531 09:45:14.234467 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:14.234804 1 webhook.go:63] Processing admission hook for pod datafinder/, UID: 443242f6-5a7c-49a6-9743-6d22c2595ada I0531 09:45:14.234820 1 webhook.go:84] Processing admission hook for pod datafinder/, UID: 443242f6-5a7c-49a6-9743-6d22c2595ada - Allowing admission for pod: no resource found I0531 09:45:14.372544 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:14.372916 1 webhook.go:63] Processing admission hook for pod datafinder/, UID: 0651ce95-f174-4461-ac10-4ef7ec39363b I0531 09:45:14.372932 1 webhook.go:84] Processing admission hook for pod datafinder/, UID: 0651ce95-f174-4461-ac10-4ef7ec39363b - Allowing admission for pod: no resource found I0531 09:45:14.722750 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:14.723151 1 webhook.go:63] Processing admission hook for pod datafinder/, UID: 977422db-afec-49b3-81b8-0435048c5b0f I0531 09:45:14.723165 1 webhook.go:84] Processing admission hook for pod datafinder/, UID: 977422db-afec-49b3-81b8-0435048c5b0f - Allowing admission for pod: no resource found W0531 09:45:27.202206 1 scheduler.go:325] get node 192.168.0.238 device error, node 192.168.0.238 not found W0531 09:45:27.202223 1 scheduler.go:325] get node 192.168.0.237 device error, node 192.168.0.237 not found W0531 09:45:27.202228 1 scheduler.go:325] get node 192.168.0.32 device error, node 192.168.0.32 not found I0531 09:45:27.759291 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:27.759805 1 webhook.go:63] Processing admission hook for pod volc-content-mng/, UID: 21941509-f7f1-4f86-90c8-6545d6af79f0 W0531 09:45:27.759818 1 webhook.go:69] Processing admission hook for pod volc-content-mng/, UID: 21941509-f7f1-4f86-90c8-6545d6af79f0 - Denying admission as container pdf-model-deploy-main is privileged I0531 09:45:27.759823 1 webhook.go:84] Processing admission hook for pod volc-content-mng/, UID: 21941509-f7f1-4f86-90c8-6545d6af79f0 - Allowing admission for pod: no resource found I0531 09:45:27.885142 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:27.885640 1 webhook.go:63] Processing admission hook for pod volc-content-mng/, UID: 9081502d-73ca-4b9e-bd35-20f60632fe55 W0531 09:45:27.885652 1 webhook.go:69] Processing admission hook for pod volc-content-mng/, UID: 9081502d-73ca-4b9e-bd35-20f60632fe55 - Denying admission as container pdf-model-deploy-main is privileged I0531 09:45:27.885656 1 webhook.go:84] Processing admission hook for pod volc-content-mng/, UID: 9081502d-73ca-4b9e-bd35-20f60632fe55 - Allowing admission for pod: no resource found
==============NVSMI LOG==============
Timestamp : Fri May 31 17:54:34 2024 Driver Version : 470.57.02 CUDA Version : 11.4
Attached GPUs : 2 GPU 00000000:65:01.0 Product Name : NVIDIA A30 Product Brand : NVIDIA Display Mode : Enabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : Disabled Pending : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 1324021108933 GPU UUID : GPU-14637886-1d37-61c4-a71e-fb5ca495ef65 Minor Number : 1 VBIOS Version : 92.00.66.00.03 MultiGPU Board : No Board ID : 0x6501 GPU Part Number : 900-21001-0040-000 Module ID : 0 Inforom Version Image Version : 1001.0205.00.02 OEM Object : 2.0 ECC Object : 6.16 Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : Pass-Through Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x65 Device : 0x01 Domain : 0x0000 Device Id : 0x20B710DE Bus Id : 00000000:65:01.0 Sub System Id : 0x153210DE GPU Link Info PCIe Generation Max : 4 Current : 4 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : N/A Performance State : P0 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 24258 MiB Used : 3 MiB Free : 24255 MiB BAR1 Memory Usage Total : 32768 MiB Used : 2 MiB Free : 32766 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : Disabled Pending : Disabled ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows Correctable Error : 1 Uncorrectable Error : 0 Pending : No Remapping Failure Occurred : No Bank Remap Availability Histogram Max : 383 bank(s) High : 1 bank(s) Partial : 0 bank(s) Low : 0 bank(s) None : 0 bank(s) Temperature GPU Current Temp : 27 C GPU Shutdown Temp : 100 C GPU Slowdown Temp : 97 C GPU Max Operating Temp : 90 C GPU Target Temperature : N/A Memory Current Temp : 28 C Memory Max Operating Temp : 95 C Power Readings Power Management : Supported Power Draw : 26.95 W Power Limit : 165.00 W Default Power Limit : 165.00 W Enforced Power Limit : 165.00 W Min Power Limit : 100.00 W Max Power Limit : 165.00 W Clocks Graphics : 210 MHz SM : 210 MHz Memory : 1215 MHz Video : 585 MHz Applications Clocks Graphics : 930 MHz Memory : 1215 MHz Default Applications Clocks Graphics : 930 MHz Memory : 1215 MHz Max Clocks Graphics : 1440 MHz SM : 1440 MHz Memory : 1215 MHz Video : 1305 MHz Max Customer Boost Clocks Graphics : 1440 MHz Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : 675.000 mV Processes : None
GPU 00000000:67:01.0 Product Name : NVIDIA A30 Product Brand : NVIDIA Display Mode : Enabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : Disabled Pending : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 1324021109136 GPU UUID : GPU-4edca201-9195-cf92-98c7-e7c2d627742b Minor Number : 0 VBIOS Version : 92.00.66.00.03 MultiGPU Board : No Board ID : 0x6701 GPU Part Number : 900-21001-0040-000 Module ID : 0 Inforom Version Image Version : 1001.0205.00.02 OEM Object : 2.0 ECC Object : 6.16 Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : Pass-Through Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x67 Device : 0x01 Domain : 0x0000 Device Id : 0x20B710DE Bus Id : 00000000:67:01.0 Sub System Id : 0x153210DE GPU Link Info PCIe Generation Max : 4 Current : 4 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : N/A Performance State : P0 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 24258 MiB Used : 3 MiB Free : 24255 MiB BAR1 Memory Usage Total : 32768 MiB Used : 2 MiB Free : 32766 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : Disabled Pending : Disabled ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows Correctable Error : 0 Uncorrectable Error : 0 Pending : No Remapping Failure Occurred : No Bank Remap Availability Histogram Max : 384 bank(s) High : 0 bank(s) Partial : 0 bank(s) Low : 0 bank(s) None : 0 bank(s) Temperature GPU Current Temp : 29 C GPU Shutdown Temp : 100 C GPU Slowdown Temp : 97 C GPU Max Operating Temp : 90 C GPU Target Temperature : N/A Memory Current Temp : 30 C Memory Max Operating Temp : 95 C Power Readings Power Management : Supported Power Draw : 28.68 W Power Limit : 165.00 W Default Power Limit : 165.00 W Enforced Power Limit : 165.00 W Min Power Limit : 100.00 W Max Power Limit : 165.00 W Clocks Graphics : 210 MHz SM : 210 MHz Memory : 1215 MHz Video : 585 MHz Applications Clocks Graphics : 930 MHz Memory : 1215 MHz Default Applications Clocks Graphics : 930 MHz Memory : 1215 MHz Max Clocks Graphics : 1440 MHz SM : 1440 MHz Memory : 1215 MHz Video : 1305 MHz Max Customer Boost Clocks Graphics : 1440 MHz Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : 687.500 mV Processes : None
{ "default-runtime": "nvidia", "default-ulimits": { "core": { "Hard": 0, "Name": "core", "Soft": 0 } }, "icc": true, "insecure-registries": [ "dockerhub.vpc.com", "dockerhub.vpc.com:5000" ], "iptables": true, "live-restore": true, "log-driver": "json-file", "log-opts": { "max-file": "10", "max-size": "50m", "mode": "non-blocking" }, "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "storage-driver": "overlay2" }
Client: Docker Engine - Community Version: 20.10.17 API version: 1.41 Go version: go1.17.11 Git commit: 100c701 Built: Mon Jun 6 23:05:12 2022 OS/Arch: linux/amd64 Context: default Experimental: true
Server: Docker Engine - Community Engine: Version: 20.10.17 API version: 1.41 (minimum version 1.12) Go version: go1.17.11 Git commit: a89b842 Built: Mon Jun 6 23:03:33 2022 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.6 GitCommit: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1 nvidia: Version: 1.1.2 GitCommit: v1.1.2-0-ga916309 docker-init: Version: 0.19.0 GitCommit: de40ad0
Linux iv-yd1gn44wzkqc6il2ciqz 3.10.0-1160.102.1.el7.x86_64 #1 SMP Tue Oct 17 15:42:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux