Closed dglo closed 3 years ago
[tbendfelt on 2016-10-19 19:43:40] I was able to repeat the issue. When a DOM is powered off during data collection there is a chance that a call into the device file is orphaned. The call can be a read, write or close (perhaps open?). When this happens, the system becomes unstable. Mundane processes start spiking cpu utilization. A core can also become locked up. And eventually the box locks up.
Example:
On scube with omicron running with 64 DOMS. Issued off "0 0 0 1 0 2 0 3".
A thread servicing Channel 02A is orphaned in the close call on the device file:
"02A-timer" #22 prio=5 os_prio=0 tid=0x00007fccfc2d6800 nid=0xfe2 runnable [0x00007fcce8ddb000] java.lang.Thread.State: RUNNABLE at java.io.RandomAccessFile.close0(Native Method) at java.io.RandomAccessFile.access$000(RandomAccessFile.java:59) at java.io.RandomAccessFile$1.close(RandomAccessFile.java:619) at java.io.FileDescriptor.closeAll(FileDescriptor.java:212)
Eventually CPU0 locks up and the box becomes non-responsive:
top - 15:46:04 up 17:09, 3 users, load average: 11.03, 7.90, 5.20 Tasks: 218 total, 2 running, 216 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi,100.0%si, 0.0%st Cpu1 : 73.3%us, 25.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 1.0%si, 0.0%st Mem: 4047984k total, 3211076k used, 836908k free, 26344k buffers Swap: 32767996k total, 3796k used, 32764200k free, 1755952k cached
[jkelley on 2017-01-31 19:26:38] This was apparently a design feature â blocking r/w stalls if the device is not connected and non-blocking r/w returns -EAGAIN. However this is not the behavior we want.
[jkelley on 2017-02-09 16:16:39] Fixed in V02-15-01. Note that an in-progress / blocked write will not bail, only new writes(). However the former case should be exceptionally rare.
No associated GitHub commit