Lakr233 / mobilePillowTalkLite

An iOS & SwiftUI server monitor tool for linux based machines using remote proc file system with script execution.
BSD 3-Clause "New" or "Revised" License
508 stars 81 forks source link

[新功能需求/方向询问]增加对服务器GPU信息的统计 #4

Open coca-huang opened 3 years ago

coca-huang commented 3 years ago

需求将会解决何种问题?

  1. 当前神经网络训练任务状态是否正常。
  2. 集群GPU是否有空余计算资源,当有空余资源时通知。

需求是否依赖于其他模块?

基于实现方式可能需要引入XML解析为Swift对象的相关包。

需求应该实现什么样的功能?

由于暂未理清上下游调用方式,以下仅为写ISSUE时的推测。

  1. 在文件FuntionSet+SSH.swift中实现obtainGPUInfo函数。~这个地方没弄清各函数的返回值有什么规范~
  2. 设计UI中该部分信息的展示方式。

需求有哪些实现方式?

当前用于神经网络训练的显卡主为基于CUDA的Nvidia的显卡,因此可使用基于官方nvidia-smi的命令(该指令通常预装于服务器)。

> $(which nvidia-smi) -q -x

该指令将会以XML形式返回服务器GPU信息,可根据需求进行格式转换并获取相应数据。

需求会对现有功能造成什么影响?

由于仅仅理解了FuntionSet+SSH.swift,尚未清楚上游的调用方式,无法详细估计影响,也因此未使用提交PR的方式增加功能。

Lakr233 commented 3 years ago

$(which nvidia-smi) -q -x

把这货输出贴上来看看?

程序在设计的时候并没有找到合适的数据组织关系所以在处理展示的时候对每一个类型的数据都做的单独的 View 也就和你说的一样 并没有什么规范。因此如果你需要这方面的功能,只需要按照自己的设想走就行了,大概是这么几步。

Lakr233 commented 3 years ago

另外 一直没做是因为我没有绿卡 只有红卡 将来蓝卡估计也不会有 🌚

coca-huang commented 3 years ago

把这货输出贴上来看看?

XML版本看起来不是很不方便,因此转换了一份JSON版本附在下面。信息比较全面,仅有少部分信息为日常需要。

序列号等已经脱敏。

XML ```xml Sun Oct 24 18:36:12 2021 470.57.02 11.4 1 Tesla V100-SXM2-32GB Tesla Enabled Disabled Disabled N/A N/A None Disabled 4000 N/A N/A xxx GPU-xxx 7 88.00.80.00.01 No 0xb200 900-2G503-0430-000 2 G503.0203.00.05 1.1 5.0 N/A N/A N/A N/A None N/A N/A B2 00 0000 xxx 00000000:B2:00.0 xxx 3 3 16x 16x N/A N/A 0 0 197000 KB/s 259000 KB/s N/A P0 Not Active Not Active Not Active Not Active Not Active Not Active Not Active Not Active Not Active 32510 MiB 26162 MiB 6348 MiB 32768 MiB 10 MiB 32758 MiB Default 100 % 55 % 0 % 0 % 0 0 0 0 0 0 Enabled Enabled 0 0 0 0 N/A N/A N/A 0 0 0 0 0 N/A N/A 0 0 0 0 0 0 N/A N/A N/A 0 0 0 0 0 N/A N/A 0 0 0 0 No No N/A 73 C 90 C 87 C 83 C N/A 68 C 85 C N/A N/A P0 Supported 279.52 W 300.00 W 300.00 W 300.00 W 150.00 W 300.00 W 1530 MHz 1530 MHz 877 MHz 1372 MHz 1290 MHz 877 MHz 1290 MHz 877 MHz 1530 MHz 1530 MHz 877 MHz 1372 MHz 1530 MHz N/A N/A N/A 877 MHz 1530 MHz 1522 MHz 1515 MHz 1507 MHz 1500 MHz 1492 MHz 1485 MHz 1477 MHz 1470 MHz 1462 MHz 1455 MHz 1447 MHz 1440 MHz 1432 MHz 1425 MHz 1417 MHz 1410 MHz 1402 MHz 1395 MHz 1387 MHz 1380 MHz 1372 MHz 1365 MHz 1357 MHz 1350 MHz 1342 MHz 1335 MHz 1327 MHz 1320 MHz 1312 MHz 1305 MHz 1297 MHz 1290 MHz 1282 MHz 1275 MHz 1267 MHz 1260 MHz 1252 MHz 1245 MHz 1237 MHz 1230 MHz 1222 MHz 1215 MHz 1207 MHz 1200 MHz 1192 MHz 1185 MHz 1177 MHz 1170 MHz 1162 MHz 1155 MHz 1147 MHz 1140 MHz 1132 MHz 1125 MHz 1117 MHz 1110 MHz 1102 MHz 1095 MHz 1087 MHz 1080 MHz 1072 MHz 1065 MHz 1057 MHz 1050 MHz 1042 MHz 1035 MHz 1027 MHz 1020 MHz 1012 MHz 1005 MHz 997 MHz 990 MHz 982 MHz 975 MHz 967 MHz 960 MHz 952 MHz 945 MHz 937 MHz 930 MHz 922 MHz 915 MHz 907 MHz 900 MHz 892 MHz 885 MHz 877 MHz 870 MHz 862 MHz 855 MHz 847 MHz 840 MHz 832 MHz 825 MHz 817 MHz 810 MHz 802 MHz 795 MHz 787 MHz 780 MHz 772 MHz 765 MHz 757 MHz 750 MHz 742 MHz 735 MHz 727 MHz 720 MHz 712 MHz 705 MHz 697 MHz 690 MHz 682 MHz 675 MHz 667 MHz 660 MHz 652 MHz 645 MHz 637 MHz 630 MHz 622 MHz 615 MHz 607 MHz 600 MHz 592 MHz 585 MHz 577 MHz 570 MHz 562 MHz 555 MHz 547 MHz 540 MHz 532 MHz 525 MHz 517 MHz 510 MHz 502 MHz 495 MHz 487 MHz 480 MHz 472 MHz 465 MHz 457 MHz 450 MHz 442 MHz 435 MHz 427 MHz 420 MHz 412 MHz 405 MHz 397 MHz 390 MHz 382 MHz 375 MHz 367 MHz 360 MHz 352 MHz 345 MHz 337 MHz 330 MHz 322 MHz 315 MHz 307 MHz 300 MHz 292 MHz 285 MHz 277 MHz 270 MHz 262 MHz 255 MHz 247 MHz 240 MHz 232 MHz 225 MHz 217 MHz 210 MHz 202 MHz 195 MHz 187 MHz 180 MHz 172 MHz 165 MHz 157 MHz 150 MHz 142 MHz 135 MHz ```
JSON ```json { "nvidia_smi_log": { "timestamp": "Sun Oct 24 18:36:12 2021", "driver_version": "470.57.02", "cuda_version": "11.4", "attached_gpus": "1", "gpu": { "-id": "00000000:B2:00.0", "product_name": "Tesla V100-SXM2-32GB", "product_brand": "Tesla", "display_mode": "Enabled", "display_active": "Disabled", "persistence_mode": "Disabled", "mig_mode": { "current_mig": "N/A", "pending_mig": "N/A" }, "mig_devices": "\n\t\t\tNone\n\t\t", "accounting_mode": "Disabled", "accounting_mode_buffer_size": "4000", "driver_model": { "current_dm": "N/A", "pending_dm": "N/A" }, "serial": "xxx", "uuid": "GPU-xxx", "minor_number": "7", "vbios_version": "88.00.80.00.01", "multigpu_board": "No", "board_id": "0xb200", "gpu_part_number": "900-2G503-0430-000", "gpu_module_id": "2", "inforom_version": { "img_version": "G503.0203.00.05", "oem_object": "1.1", "ecc_object": "5.0", "pwr_object": "N/A" }, "gpu_operation_mode": { "current_gom": "N/A", "pending_gom": "N/A" }, "gsp_firmware_version": "N/A", "gpu_virtualization_mode": { "virtualization_mode": "None", "host_vgpu_mode": "N/A" }, "ibmnpu": { "relaxed_ordering_mode": "N/A" }, "pci": { "pci_bus": "B2", "pci_device": "00", "pci_domain": "0000", "pci_device_id": "xxx", "pci_bus_id": "00000000:B2:00.0", "pci_sub_system_id": "xxx", "pci_gpu_link_info": { "pcie_gen": { "max_link_gen": "3", "current_link_gen": "3" }, "link_widths": { "max_link_width": "16x", "current_link_width": "16x" } }, "pci_bridge_chip": { "bridge_chip_type": "N/A", "bridge_chip_fw": "N/A" }, "replay_counter": "0", "replay_rollover_counter": "0", "tx_util": "197000 KB/s", "rx_util": "259000 KB/s" }, "fan_speed": "N/A", "performance_state": "P0", "clocks_throttle_reasons": { "clocks_throttle_reason_gpu_idle": "Not Active", "clocks_throttle_reason_applications_clocks_setting": "Not Active", "clocks_throttle_reason_sw_power_cap": "Not Active", "clocks_throttle_reason_hw_slowdown": "Not Active", "clocks_throttle_reason_hw_thermal_slowdown": "Not Active", "clocks_throttle_reason_hw_power_brake_slowdown": "Not Active", "clocks_throttle_reason_sync_boost": "Not Active", "clocks_throttle_reason_sw_thermal_slowdown": "Not Active", "clocks_throttle_reason_display_clocks_setting": "Not Active" }, "fb_memory_usage": { "total": "32510 MiB", "used": "26162 MiB", "free": "6348 MiB" }, "bar1_memory_usage": { "total": "32768 MiB", "used": "10 MiB", "free": "32758 MiB" }, "compute_mode": "Default", "utilization": { "gpu_util": "100 %", "memory_util": "55 %", "encoder_util": "0 %", "decoder_util": "0 %" }, "encoder_stats": { "session_count": "0", "average_fps": "0", "average_latency": "0" }, "fbc_stats": { "session_count": "0", "average_fps": "0", "average_latency": "0" }, "ecc_mode": { "current_ecc": "Enabled", "pending_ecc": "Enabled" }, "ecc_errors": { "volatile": { "single_bit": { "device_memory": "0", "register_file": "0", "l1_cache": "0", "l2_cache": "0", "texture_memory": "N/A", "texture_shm": "N/A", "cbu": "N/A", "total": "0" }, "double_bit": { "device_memory": "0", "register_file": "0", "l1_cache": "0", "l2_cache": "0", "texture_memory": "N/A", "texture_shm": "N/A", "cbu": "0", "total": "0" } }, "aggregate": { "single_bit": { "device_memory": "0", "register_file": "0", "l1_cache": "0", "l2_cache": "0", "texture_memory": "N/A", "texture_shm": "N/A", "cbu": "N/A", "total": "0" }, "double_bit": { "device_memory": "0", "register_file": "0", "l1_cache": "0", "l2_cache": "0", "texture_memory": "N/A", "texture_shm": "N/A", "cbu": "0", "total": "0" } } }, "retired_pages": { "multiple_single_bit_retirement": { "retired_count": "0", "retired_pagelist": "\n\t\t\t\t" }, "double_bit_retirement": { "retired_count": "0", "retired_pagelist": "\n\t\t\t\t" }, "pending_blacklist": "No", "pending_retirement": "No" }, "remapped_rows": "N/A", "temperature": { "gpu_temp": "73 C", "gpu_temp_max_threshold": "90 C", "gpu_temp_slow_threshold": "87 C", "gpu_temp_max_gpu_threshold": "83 C", "gpu_target_temperature": "N/A", "memory_temp": "68 C", "gpu_temp_max_mem_threshold": "85 C" }, "supported_gpu_target_temp": { "gpu_target_temp_min": "N/A", "gpu_target_temp_max": "N/A" }, "power_readings": { "power_state": "P0", "power_management": "Supported", "power_draw": "279.52 W", "power_limit": "300.00 W", "default_power_limit": "300.00 W", "enforced_power_limit": "300.00 W", "min_power_limit": "150.00 W", "max_power_limit": "300.00 W" }, "clocks": { "graphics_clock": "1530 MHz", "sm_clock": "1530 MHz", "mem_clock": "877 MHz", "video_clock": "1372 MHz" }, "applications_clocks": { "graphics_clock": "1290 MHz", "mem_clock": "877 MHz" }, "default_applications_clocks": { "graphics_clock": "1290 MHz", "mem_clock": "877 MHz" }, "max_clocks": { "graphics_clock": "1530 MHz", "sm_clock": "1530 MHz", "mem_clock": "877 MHz", "video_clock": "1372 MHz" }, "max_customer_boost_clocks": { "graphics_clock": "1530 MHz" }, "clock_policy": { "auto_boost": "N/A", "auto_boost_default": "N/A" }, "voltage": { "graphics_volt": "N/A" }, "supported_clocks": { "supported_mem_clock": { "value": "877 MHz", "supported_graphics_clock": [ "1530 MHz", "1522 MHz", "1515 MHz", "1507 MHz", "1500 MHz", "1492 MHz", "1485 MHz", "1477 MHz", "1470 MHz", "1462 MHz", "1455 MHz", "1447 MHz", "1440 MHz", "1432 MHz", "1425 MHz", "1417 MHz", "1410 MHz", "1402 MHz", "1395 MHz", "1387 MHz", "1380 MHz", "1372 MHz", "1365 MHz", "1357 MHz", "1350 MHz", "1342 MHz", "1335 MHz", "1327 MHz", "1320 MHz", "1312 MHz", "1305 MHz", "1297 MHz", "1290 MHz", "1282 MHz", "1275 MHz", "1267 MHz", "1260 MHz", "1252 MHz", "1245 MHz", "1237 MHz", "1230 MHz", "1222 MHz", "1215 MHz", "1207 MHz", "1200 MHz", "1192 MHz", "1185 MHz", "1177 MHz", "1170 MHz", "1162 MHz", "1155 MHz", "1147 MHz", "1140 MHz", "1132 MHz", "1125 MHz", "1117 MHz", "1110 MHz", "1102 MHz", "1095 MHz", "1087 MHz", "1080 MHz", "1072 MHz", "1065 MHz", "1057 MHz", "1050 MHz", "1042 MHz", "1035 MHz", "1027 MHz", "1020 MHz", "1012 MHz", "1005 MHz", "997 MHz", "990 MHz", "982 MHz", "975 MHz", "967 MHz", "960 MHz", "952 MHz", "945 MHz", "937 MHz", "930 MHz", "922 MHz", "915 MHz", "907 MHz", "900 MHz", "892 MHz", "885 MHz", "877 MHz", "870 MHz", "862 MHz", "855 MHz", "847 MHz", "840 MHz", "832 MHz", "825 MHz", "817 MHz", "810 MHz", "802 MHz", "795 MHz", "787 MHz", "780 MHz", "772 MHz", "765 MHz", "757 MHz", "750 MHz", "742 MHz", "735 MHz", "727 MHz", "720 MHz", "712 MHz", "705 MHz", "697 MHz", "690 MHz", "682 MHz", "675 MHz", "667 MHz", "660 MHz", "652 MHz", "645 MHz", "637 MHz", "630 MHz", "622 MHz", "615 MHz", "607 MHz", "600 MHz", "592 MHz", "585 MHz", "577 MHz", "570 MHz", "562 MHz", "555 MHz", "547 MHz", "540 MHz", "532 MHz", "525 MHz", "517 MHz", "510 MHz", "502 MHz", "495 MHz", "487 MHz", "480 MHz", "472 MHz", "465 MHz", "457 MHz", "450 MHz", "442 MHz", "435 MHz", "427 MHz", "420 MHz", "412 MHz", "405 MHz", "397 MHz", "390 MHz", "382 MHz", "375 MHz", "367 MHz", "360 MHz", "352 MHz", "345 MHz", "337 MHz", "330 MHz", "322 MHz", "315 MHz", "307 MHz", "300 MHz", "292 MHz", "285 MHz", "277 MHz", "270 MHz", "262 MHz", "255 MHz", "247 MHz", "240 MHz", "232 MHz", "225 MHz", "217 MHz", "210 MHz", "202 MHz", "195 MHz", "187 MHz", "180 MHz", "172 MHz", "165 MHz", "157 MHz", "150 MHz", "142 MHz", "135 MHz" ] } }, "processes": "\n\t\t", "accounted_processes": "\n\t\t" } } } ```

另外 一直没做是因为我没有绿卡 只有红卡 将来蓝卡估计也不会有 🌚

大部分神经网络架构(PyTorch, TensorFlow)仅以测试模式支持AMD基于ROCm的GPU,可能以后逐渐会发展。 目前可以使用AMD官方的rocm-smi --json获取GPU信息的JSON格式。

~Intel真不熟~

Lakr233 commented 3 years ago

行 我试试 不过大概率短时间内不会更新 可能还是得你自己来

目前 Foundation 里头只要不乱写脚本就草不烂(我写了很详细的单元测试但是由于里头带了测试机地址所以开源的时候裁掉了.jpg

Lakr233 commented 3 years ago
截屏2021-11-03 下午10 55 09

今天把 Xcode 13 编译的问题修好了 并合并了上游 SwiftTerm 的更新 可能会引入新的 Bug 不过在这之前可以玩耍一下

然后是 iOS 自带 XML Parser 所以 应该不会很难 不过这个版本的代码我自己也很久没看了 不敢随便动手了