DeskPi-Team / super6c

Super6c stands for Super 6 CM4 Cluster.
MIT License
72 stars 5 forks source link

NVMe running too hot; internal network switch issues related to SSDs #9

Closed 1201am closed 8 months ago

1201am commented 1 year ago

Is there a list of tested and officially supported m.2 NVMe SSDs that work with DeskPi Super6C?

My personal experience is with Kioxia BG4 256GB m.2 NVMe SSD (KBG40ZNV256G). Purchase 6 SSDs and noticed that they're running very hot even idle, frequently above 60C (they're 20 degrees cooler when attached to my Dell PC). Purchased 4 more of the same SSDs, same story - running too hot.

Also, with 5 out of 10 Kioxia SSDs I've purchased, I have one or two CM4 modules losing network connectivity permanently. LEDs near the modules show that the CM4 is up and running but cannot be reached over the network. Once the SSD that is causing the issue is removed, everything connects just fine. I've noticed that if the "problem" SSD is installed, say, on slot #6, that causes connectivity loss on CM4 #4 so no direct mapping between module # and SSD port #.

There is a slight possibility that this is caused by SSD's firmware but all 10 SSDs test just fine on a PC. Finding (by trial and error) which SSD is causing network problem and replacing it with Samsung 960 EVO solves the problem with network connectivity. Also, Samsung 960 EVO is running much cooler than Kioxia even though the latter is supposed to be more energy efficient.

Tried 3 different DeskPi Super6C boards (the latest one is the revision with LED lights on network ports) as I initially thought that it is a defective board but the issue with hot Kioxia BG4 and disconnected modules is replicated across all Super6C boards.

Something that is specific to Kioxia BG4 is that they're HMB (host memory buffer technology) SSDs, not sure if that is affecting the temperatures somehow but I think it will be great if we have a manufacturer-validated list of officially supported SSDs that work with Super6C.

aderesh commented 1 year ago

If it helps, I use 3 MMOMENT NVMe M.2 2280 PCIe Gen 3x4 256GB and one MMOMENT NVMe M.2 2280 PCIe Gen 3x4 1TB. It's a k3s cluster with longhorn and a few workloads (media server, jenkins etc). Super6c with LED lights on network ports + 12 cm fan. Micron 512 GB NVMe (don't know the model) didn't work as bootable NVME but worked in SD(bootable) + NVME configuration - not sure if it's something wrong with that unit or some sort of a hardware incompatibility.

Below, is the output of nvme-cli + pi temp:

andrii@WINAP65h1JvjOzb:~/ansible$ ansible all -a "sudo nvme smart-log /dev/nvme0"
[WARNING]: Consider using 'become', 'become_method', and 'become_user' rather than running sudo
rp3 | CHANGED | rc=0 >>
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 33 C
available_spare                         : 100%
available_spare_threshold               : 5%
percentage_used                         : 0%
endurance group critical warning summary: 0
data_units_read                         : 18,710
data_units_written                      : 50,817
host_read_commands                      : 276,639
host_write_commands                     : 1,064,477
controller_busy_time                    : 4
power_cycles                            : 4
power_on_hours                          : 77
unsafe_shutdowns                        : 4
media_errors                            : 0
num_err_log_entries                     : 0
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Temperature Sensor 1           : 46 C
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0
rpm | CHANGED | rc=0 >>
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 34 C
available_spare                         : 100%
available_spare_threshold               : 5%
percentage_used                         : 0%
endurance group critical warning summary: 0
data_units_read                         : 22,871
data_units_written                      : 155,997
host_read_commands                      : 296,659
host_write_commands                     : 2,088,952
controller_busy_time                    : 19
power_cycles                            : 4
power_on_hours                          : 70
unsafe_shutdowns                        : 3
media_errors                            : 0
num_err_log_entries                     : 0
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Temperature Sensor 1           : 48 C
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0
rp2 | CHANGED | rc=0 >>
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 34 C
available_spare                         : 100%
available_spare_threshold               : 5%
percentage_used                         : 0%
endurance group critical warning summary: 0
data_units_read                         : 18,920
data_units_written                      : 46,548
host_read_commands                      : 281,020
host_write_commands                     : 884,612
controller_busy_time                    : 4
power_cycles                            : 4
power_on_hours                          : 76
unsafe_shutdowns                        : 4
media_errors                            : 0
num_err_log_entries                     : 0
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Temperature Sensor 1           : 47 C
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0
rp4 | CHANGED | rc=0 >>
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 33 C
available_spare                         : 100%
available_spare_threshold               : 5%
percentage_used                         : 0%
endurance group critical warning summary: 0
data_units_read                         : 18,734
data_units_written                      : 44,975
host_read_commands                      : 272,194
host_write_commands                     : 998,486
controller_busy_time                    : 4
power_cycles                            : 4
power_on_hours                          : 76
unsafe_shutdowns                        : 4
media_errors                            : 0
num_err_log_entries                     : 0
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Temperature Sensor 1           : 46 C
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0
rp1 | CHANGED | rc=0 >>
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 39 C
available_spare                         : 100%
available_spare_threshold               : 5%
percentage_used                         : 0%
endurance group critical warning summary: 0
data_units_read                         : 19,999
data_units_written                      : 48,876
host_read_commands                      : 274,106
host_write_commands                     : 1,084,030
controller_busy_time                    : 4
power_cycles                            : 9
power_on_hours                          : 77
unsafe_shutdowns                        : 9
media_errors                            : 0
num_err_log_entries                     : 0
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Temperature Sensor 1           : 53 C
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0
andrii@WINAP65h1JvjOzb:~/ansible$ ansible all -a "vcgencmd measure_temp"
rp3 | CHANGED | rc=0 >>
temp=43.3'C
rp1 | CHANGED | rc=0 >>
temp=45.7'C
rp4 | CHANGED | rc=0 >>
temp=44.3'C
rp2 | CHANGED | rc=0 >>
temp=41.8'C
rpm | CHANGED | rc=0 >>
temp=44.8'C
avluis commented 1 year ago

@aderesh can you share what case you use and in what order are your CM4s in comparison to their hostnames? "official" case from DeskPi or your own?

I'd like to compare your module temperatures to mine -- I run my equipment slightly below room temperature on a shelf in a 42U rack

Mine are mapped [a:f] == slot [1:6]

root@clpi-cm4a ~# ansible -i ~/ansible/hosts all -b -a "vcgencmd measure_temp"
clpi-cm4a | CHANGED | rc=0 >>
temp=47.2'C
clpi-cm4b | CHANGED | rc=0 >>
temp=38.4'C
clpi-cm4c | CHANGED | rc=0 >>
temp=36.5'C
clpi-cm4d | CHANGED | rc=0 >>
temp=37.0'C
clpi-cm4e | CHANGED | rc=0 >>
temp=38.9'C
clpi-cm4f | CHANGED | rc=0 >>
temp=41.8'C

As you can see one of these runs a tad hotter; it happens to be the only module not bought in the same batch (this one is 4GB RAM vs 8GB and has Wi-Fi vs no Wi-Fi) which can contribute to this but considering making a small paper/plastic duct to help funnel more air to it.

yoyojacky commented 8 months ago

We provide a metal casing and a fan, which effectively reduces the overall temperature of Super6c. The SSDs we have tested include Kingston A400, Kingspec, and Fanxiang. We hope this information is helpful to you. The temperatures we tested did not exceed 60 degrees Celsius. The testing was conducted in a room with a temperature of 25 degrees Celsius over a period of 2 hours.

yoyojacky commented 8 months ago

Certainly, you can try replacing the SSD with different brands or consider adding a heat-dissipating casing. This may help further optimize the temperature performance of Super6c. Experimenting with various SSD brands or incorporating additional cooling solutions can contribute to achieving the desired temperature levels and enhancing the overall performance and reliability of the system.