RobinTMiller / dt

dt is a data test tool for testing disks, tapes, file systems, serial/parallel, and other devices.
Other
38 stars 15 forks source link

Totals Stats fails to print when NFS file handle is hung #9

Closed jhollowe closed 1 year ago

jhollowe commented 1 year ago

When running dt against an NFS mount (see the full CLI used below), if the NFS mount goes offline/unresponsive during the test, it causes the thread running the test to hang and eventually be killed by a timeout. This causes the Total Stats information that should have been printed by that thread to not be output as the thread has been killed.

Is there a way to still allow the thread to print out the stats? Or hand off the needed information to another thread so it can print off the stats?

dt version: 25.03 OS: linux (RHEL)

CLI line: dt enable=noprog,microdelay noprogt=1 iobehavior=dt flags=direct bs=random pattern=incr of=/t/vol_nfs_1/dt_1689757132 runtime=300 rdelay=40000 wdelay=40000 limit=4636024847 dispose=delete iotype=random

Log demonstrating error:

...
155 (dt j:1 t:1): Read Statistics:
156 (dt j:1 t:1):            Current random seed: 0x649bff3b1eac1e97
157 (dt j:1 t:1):       Job Information Reported: Job 1, Thread 1
158 (dt j:1 t:1):      Data pattern read/written: 0x03020100 (incrementing 0-255)
159 (dt j:1 t:1):        Total records processed: 340 with min=512, max=1048576, incr=variable
160 (dt j:1 t:1):        Total bytes transferred: 168093071 (164153.390 Kbytes, 160.306 Mbytes)
161 (dt j:1 t:1):         Average transfer rates: 12345805 bytes/sec, 12056.450 Kbytes/sec, 11.774 Mbytes/sec
162 (dt j:1 t:1):        Number I/O's per second: 24.972
163 (dt j:1 t:1):         Number seconds per I/O: 0.0400 (40.05ms)
164 (dt j:1 t:1):         Total passes completed: 5
165 (dt j:1 t:1):          Total errors detected: 0/1
166 (dt j:1 t:1):             Total elapsed time: 00m13.61s
167 (dt j:1 t:1):              Total system time: 00m00.05s
168 (dt j:1 t:1):                Total user time: 00m00.28s
169 (dt j:1 t:1):                  Starting time: Wed Jun 28 04:48:12 2023
170 (dt j:1 t:1):                    Ending time: Wed Jun 28 04:50:31 2023
171 (dt j:1 t:1): 
172 (dt j:1 t:1): Write Statistics:
173 (dt j:1 t:1):            Current random seed: 0x649bff7f1eab09ba
174 (dt j:1 t:1):       Job Information Reported: Job 1, Thread 1
175 (dt j:1 t:1):      Data pattern read/written: 0x03020100 (incrementing 0-255)
176 (dt j:1 t:1):        Total records processed: 326 with min=512, max=1048576, incr=variable
177 (dt j:1 t:1):        Total bytes transferred: 168093071 (164153.390 Kbytes, 160.306 Mbytes)
178 (dt j:1 t:1):         Average transfer rates: 11717596 bytes/sec, 11442.965 Kbytes/sec, 11.175 Mbytes/sec
179 (dt j:1 t:1):        Number I/O's per second: 22.725
180 (dt j:1 t:1):         Number seconds per I/O: 0.0440 (44.00ms)
181 (dt j:1 t:1):         Total passes completed: 5
182 (dt j:1 t:1):          Total errors detected: 0/1
183 (dt j:1 t:1):             Total elapsed time: 00m14.34s
184 (dt j:1 t:1):              Total system time: 00m00.05s
185 (dt j:1 t:1):                Total user time: 00m00.16s
186 (dt j:1 t:1):                  Starting time: Wed Jun 28 04:48:12 2023
187 (dt j:1 t:1):                    Ending time: Wed Jun 28 04:50:45 2023
188 (dt j:1 t:1): No progress made for record 312 (lba 314137, offset 160838144) during read() on /t/vol_nfs_3/dt_1687942082 for 2 seconds! Since: Wed Jun 28 04:50:58 2023
189 (dt j:1 t:1): No progress made for record 312 (lba 314137, offset 160838144) during read() on /t/vol_nfs_3/dt_1687942082 for 3 seconds! Since: Wed Jun 28 04:50:58 2023
190 (dt j:1 t:1): No progress made for record 312 (lba 314137, offset 160838144) during read() on /t/vol_nfs_3/dt_1687942082 for 4 seconds! Since: Wed Jun 28 04:50:58 2023
191 (dt j:1 t:1): No progress made for record 312 (lba 314137, offset 160838144) during read() on /t/vol_nfs_3/dt_1687942082 for 5 seconds! Since: Wed Jun 28 04:50:58 2023
192 (dt j:1 t:1): No progress made for record 312 (lba 314137, offset 160838144) during read() on /t/vol_nfs_3/dt_1687942082 for 6 seconds! Since: Wed Jun 28 04:50:58 2023
193 (dt j:1 t:1): No progress made for record 312 (lba 314137, offset 160838144) during read() on /t/vol_nfs_3/dt_1687942082 for 7 seconds! Since: Wed Jun 28 04:50:58 2023
194 (dt j:1 t:1): No progress made for record 312 (lba 314137, offset 160838144) during read() on /t/vol_nfs_3/dt_1687942082 for 8 seconds! Since: Wed Jun 28 04:50:58 2023
195 (dt j:1 t:1): No progress made for record 312 (lba 314137, offset 160838144) during read() on /t/vol_nfs_3/dt_1687942082 for 9 seconds! Since: Wed Jun 28 04:50:58 2023
196 (dt j:1 t:1): No progress made for record 312 (lba 314137, offset 160838144) during read() on /t/vol_nfs_3/dt_1687942082 for 10 seconds! Since: Wed Jun 28 04:50:58 2023
[...]
300 (dt j:1 t:1): No progress made for record 312 (lba 314137, offset 160838144) during read() on /t/vol_nfs_3/dt_1687942082 for 114 seconds! (01m54.00s) Since: Wed Jun 28 04:50:58 2023
301 (dt j:1 t:1): No progress made for record 312 (lba 314137, offset 160838144) during read() on /t/vol_nfs_3/dt_1687942082 for 115 seconds! (01m55.00s) Since: Wed Jun 28 04:50:58 2023
302 (dt j:1 t:1): No progress made for record 312 (lba 314137, offset 160838144) during read() on /t/vol_nfs_3/dt_1687942082 for 116 seconds! (01m56.00s) Since: Wed Jun 28 04:50:58 2023
303 (dt j:1 t:1): No progress made for record 312 (lba 314137, offset 160838144) during read() on /t/vol_nfs_3/dt_1687942082 for 117 seconds! (01m57.00s) Since: Wed Jun 28 04:50:58 2023
304 (dt j:1 t:1): No progress made for record 312 (lba 314137, offset 160838144) during read() on /t/vol_nfs_3/dt_1687942082 for 118 seconds! (01m58.00s) Since: Wed Jun 28 04:50:58 2023
305 (dt j:1 t:1): No progress made for record 312 (lba 314137, offset 160838144) during read() on /t/vol_nfs_3/dt_1687942082 for 119 seconds! (01m59.00s) Since: Wed Jun 28 04:50:58 2023
306 (dt j:1 t:1): No progress made for record 312 (lba 314137, offset 160838144) during read() on /t/vol_nfs_3/dt_1687942082 for 120 seconds! (02m00.00s) Since: Wed Jun 28 04:50:58 2023
307 (dt j:1 t:1): No progress made for record 312 (lba 314137, offset 160838144) during read() on /t/vol_nfs_3/dt_1687942082 for 121 seconds! (02m01.00s) Since: Wed Jun 28 04:50:58 2023
308 (dt j:1 t:1): No progress made for record 312 (lba 314137, offset 160838144) during read() on /t/vol_nfs_3/dt_1687942082 for 122 seconds! (02m02.00s) Since: Wed Jun 28 04:50:58 2023
309 (dt j:1 t:1): The runtime of 288 seconds has expired, terminating thread...
310 (dt j:1 t:1): Job 1, thread has exceeded the max terminate wait time of 180 seconds!
RobinTMiller commented 1 year ago

Sadly, when remote mounts/shares or iSCSI paths disappear, I/O is hung so never completes. Often even dt cannot exit cleanly as well, as kernel I/O rundown cannot cancel the request. The dt hung I/O detection is performed by the monitoring thread, and as written today does not provide a method to report total statistics, sorry. Even if a hook was added prior to attempting to terminate the thread, the per pass statistics will need to be accumulated before reporting the totals. I'll review.

RobinTMiller commented 1 year ago

Please try to following changes on a hung NFS mount: (I am unable to verify this myself)

robin@LAPTOP-BJH5MV95 ~/GitHub/dt $ diff -cw dt.c-orig dt.c *** dt.c-orig 2023-06-09 10:01:57.428606800 -0700 --- dt.c 2023-08-18 10:26:33.784023500 -0700


* 1,6 ** /****

robin@LAPTOP-BJH5MV95 ~/GitHub/dt $

Please let me know if this works, then I'll check this in and update the dt version, thanks!

RobinTMiller commented 1 year ago

hmmm... the code changes may have been munged in the comments, please advise if you need the updated dt.c

jhollowe commented 1 year ago

This change looks good! thanks for working on this!

RobinTMiller commented 1 year ago
Thanks John, I'll check in this change and bump the version accordingly.

Cheers, Robin

On Thu, Sep 7, 2023 at 10:42 AM John Hollowell @.***> wrote:

This change looks good! thanks for working on this!

— Reply to this email directly, view it on GitHub https://github.com/RobinTMiller/dt/issues/9#issuecomment-1710547322, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG6RMDJIGQJAPQECXCCBAWDXZIBP3ANCNFSM6AAAAAA3PZLZ64 . You are receiving this because you commented.Message ID: @.***>

jhollowe commented 1 year ago

Hey Robin, just wanted to check when you think you will be able to commit this?

RobinTMiller commented 1 year ago

These changes were pushed at this time, thanks!

Commits on Sep 14, 2023 Merge branch 'master' of github.com:RobinTMiller/dt @RobinTMiller RobinTMiller committed 3 weeks ago Report total statistics before cancelling hung threads. @RobinTMiller RobinTMiller committed 3 weeks ago