VI4IO / io500-app

Development version of the new IO-500 Application
MIT License
19 stars 11 forks source link

IOR_Hard can't return results #23

Closed abrekhov closed 4 years ago

abrekhov commented 4 years ago

Hi! I'm facing similar issue as with io500-dev repo (https://github.com/VI4IO/io-500-dev/issues/62)

I'm starting ior_hard only part and after 300s it doesn't stop. Here is slurm out file

IO500 version io500-isc20_v3-2-g46e0e53a3abf
[RESULT-invalid]       ior-easy-write        0.000000 GiB/s  : time 0.000 seconds
ERROR INVALID Runtime of phase (0.000016) is below stonewall time. This shouldn't happen!
[RESULT-invalid]    mdtest-easy-write        0.000000 kIOPS : time 0.000 seconds
ERROR INVALID Runtime of phase (0.000012) is below stonewall time. This shouldn't happen!
395: stonewalling pairs accessed: 67654
398: stonewalling pairs accessed: 65332
392: stonewalling pairs accessed: 67654
310: stonewalling pairs accessed: 71551
49: stonewalling pairs accessed: 61187
56: stonewalling pairs accessed: 63433
57: stonewalling pairs accessed: 67650
313: stonewalling pairs accessed: 72728
318: stonewalling pairs accessed: 72626
396: stonewalling pairs accessed: 65198
385: stonewalling pairs accessed: 62131
309: stonewalling pairs accessed: 72731
391: stonewalling pairs accessed: 64521
55: stonewalling pairs accessed: 67650
298: stonewalling pairs accessed: 63940
62: stonewalling pairs accessed: 64614
58: stonewalling pairs accessed: 63754
144: stonewalling pairs accessed: 72602
291: stonewalling pairs accessed: 63826
354: stonewalling pairs accessed: 72500
159: stonewalling pairs accessed: 75638
155: stonewalling pairs accessed: 71205
146: stonewalling pairs accessed: 72603
297: stonewalling pairs accessed: 67563
301: stonewalling pairs accessed: 67562
356: stonewalling pairs accessed: 71738
20: stonewalling pairs accessed: 72466
293: stonewalling pairs accessed: 64652
296: stonewalling pairs accessed: 64867
423: stonewalling pairs accessed: 72529
424: stonewalling pairs accessed: 72592
429: stonewalling pairs accessed: 72962
153: stonewalling pairs accessed: 71909
17: stonewalling pairs accessed: 72048
357: stonewalling pairs accessed: 72608
154: stonewalling pairs accessed: 72605
355: stonewalling pairs accessed: 72610
418: stonewalling pairs accessed: 72579
28: stonewalling pairs accessed: 70731
31: stonewalling pairs accessed: 72910
425: stonewalling pairs accessed: 72677
362: stonewalling pairs accessed: 72564
24: stonewalling pairs accessed: 72802
63: stonewalling pairs accessed: 67658
244: stonewalling pairs accessed: 72817
240: stonewalling pairs accessed: 69207
249: stonewalling pairs accessed: 72816
307: stonewalling pairs accessed: 72741
187: stonewalling pairs accessed: 67394
315: stonewalling pairs accessed: 72497
188: stonewalling pairs accessed: 64054
241: stonewalling pairs accessed: 71564
48: stonewalling pairs accessed: 67632

Output of results in ior-hard-write.txt

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O
Began               : Fri Jun 19 22:04:52 2020
Command line        : ./ior -v -C -Q 1 -g -G 27 -k -e -o /lustre/iospb/io500-2020spring/datafiles/2020.06.19-22.00.16-app/ior-hard/file -O stoneWallingStatusFile=/lustre/iospb/io500-2020spring/datafiles/2020.06.19-22.00.16-app/ior-hard/stonewall -O stoneWallingWearOut=1 -t 47008 -b 47008 -s 150000 -w -D 300 -a MPIIO
Machine             : Linux n06p001
Start time skew across all tasks: 847476.39 sec
TestID              : 0
StartTime           : Fri Jun 19 22:04:52 2020
Path                : /lustre/iospb/io500-2020spring/datafiles/2020.06.19-22.00.16-app/ior-hard
FS                  : 84.6 TiB   Used FS: 16.3%   Inodes: 342.0 Mi   Used Inodes: 5.4%
Participating tasks: 432
Using reorderTasks '-C' (useful to avoid read cache in client)

Options:
api                 : MPIIO
apiVersion          : (3.1)
test filename       : /lustre/iospb/io500-2020spring/datafiles/2020.06.19-22.00.16-app/ior-hard/file
access              : single-shared-file
type                : independent
segments            : 150000
ordering in a file  : sequential
ordering inter file : constant task offset
task offset         : 1
nodes               : 27
tasks               : 432
clients per node    : 16
repetitions         : 1
xfersize            : 47008 bytes
blocksize           : 47008 bytes
aggregate filesize  : 2.77 TiB
stonewallingTime    : 300
stoneWallingWearOut : 1

Results:

And I see that the process of writing has stopped already. But job running for 35 minutes. I've tried to change segmentCount , as was in previous solution, but it didn't help.

JulianKunkel commented 4 years ago

Thanks, that needs more debugging. Was that the whole content of ior-hard-write.txt? Because it appears that only 53 outputs are given. That may be an indication that a process actually hung doing IO. Can you check that no node reports a kernel error when using dmesg? In my experience that sometimes leads to hung up processes (depends on the fs). Once you triggered this issue, could you try to run the write again on the same nodes using the command line and maybe 5 seconds stonewall?

abrekhov commented 4 years ago

Here is the whole content of ior-hard-write.txt before cancelling the job:

n06p001: module loaded
IO500 version io500-isc20_v3-2-g46e0e53a3abf
[RESULT-invalid]       ior-easy-write        0.000000 GiB/s  : time 0.000 seconds
ERROR INVALID Runtime of phase (0.000016) is below stonewall time. This shouldn't happen!
[RESULT-invalid]    mdtest-easy-write        0.000000 kIOPS : time 0.000 seconds
ERROR INVALID Runtime of phase (0.000012) is below stonewall time. This shouldn't happen!
395: stonewalling pairs accessed: 67654
398: stonewalling pairs accessed: 65332
392: stonewalling pairs accessed: 67654
310: stonewalling pairs accessed: 71551
49: stonewalling pairs accessed: 61187
56: stonewalling pairs accessed: 63433
57: stonewalling pairs accessed: 67650
313: stonewalling pairs accessed: 72728
318: stonewalling pairs accessed: 72626
396: stonewalling pairs accessed: 65198
385: stonewalling pairs accessed: 62131
309: stonewalling pairs accessed: 72731
391: stonewalling pairs accessed: 64521
55: stonewalling pairs accessed: 67650
298: stonewalling pairs accessed: 63940
62: stonewalling pairs accessed: 64614
58: stonewalling pairs accessed: 63754
144: stonewalling pairs accessed: 72602
291: stonewalling pairs accessed: 63826
354: stonewalling pairs accessed: 72500
159: stonewalling pairs accessed: 75638
155: stonewalling pairs accessed: 71205
146: stonewalling pairs accessed: 72603
297: stonewalling pairs accessed: 67563
301: stonewalling pairs accessed: 67562
356: stonewalling pairs accessed: 71738
20: stonewalling pairs accessed: 72466
293: stonewalling pairs accessed: 64652
296: stonewalling pairs accessed: 64867
423: stonewalling pairs accessed: 72529
424: stonewalling pairs accessed: 72592
429: stonewalling pairs accessed: 72962
153: stonewalling pairs accessed: 71909
17: stonewalling pairs accessed: 72048
357: stonewalling pairs accessed: 72608
154: stonewalling pairs accessed: 72605
355: stonewalling pairs accessed: 72610
418: stonewalling pairs accessed: 72579
28: stonewalling pairs accessed: 70731
31: stonewalling pairs accessed: 72910
425: stonewalling pairs accessed: 72677
362: stonewalling pairs accessed: 72564
24: stonewalling pairs accessed: 72802
63: stonewalling pairs accessed: 67658
244: stonewalling pairs accessed: 72817
240: stonewalling pairs accessed: 69207
249: stonewalling pairs accessed: 72816
307: stonewalling pairs accessed: 72741
187: stonewalling pairs accessed: 67394
315: stonewalling pairs accessed: 72497
188: stonewalling pairs accessed: 64054
241: stonewalling pairs accessed: 71564
48: stonewalling pairs accessed: 67632
248: stonewalling pairs accessed: 71551
397: stonewalling pairs accessed: 67671
376: stonewalling pairs accessed: 68725
316: stonewalling pairs accessed: 72744
221: stonewalling pairs accessed: 72759
368: stonewalling pairs accessed: 63701
184: stonewalling pairs accessed: 62918
386: stonewalling pairs accessed: 62361
186: stonewalling pairs accessed: 64504
344: stonewalling pairs accessed: 72617
254: stonewalling pairs accessed: 71861
308: stonewalling pairs accessed: 69133
348: stonewalling pairs accessed: 71662
387: stonewalling pairs accessed: 63302
304: stonewalling pairs accessed: 64477
305: stonewalling pairs accessed: 65171
317: stonewalling pairs accessed: 72232
50: stonewalling pairs accessed: 62054
388: stonewalling pairs accessed: 64537
191: stonewalling pairs accessed: 67399
399: stonewalling pairs accessed: 67672
311: stonewalling pairs accessed: 72746
347: stonewalling pairs accessed: 72616
210: stonewalling pairs accessed: 65953
306: stonewalling pairs accessed: 67270
379: stonewalling pairs accessed: 68110
346: stonewalling pairs accessed: 71986
233: stonewalling pairs accessed: 69550
51: stonewalling pairs accessed: 63450
314: stonewalling pairs accessed: 71120
393: stonewalling pairs accessed: 64538
145: stonewalling pairs accessed: 70753
384: stonewalling pairs accessed: 61948
213: stonewalling pairs accessed: 72762
367: stonewalling pairs accessed: 72620
282: stonewalling pairs accessed: 68719
319: stonewalling pairs accessed: 72748
390: stonewalling pairs accessed: 64355
220: stonewalling pairs accessed: 72182
389: stonewalling pairs accessed: 63203
394: stonewalling pairs accessed: 65344
312: stonewalling pairs accessed: 71631
287: stonewalling pairs accessed: 61708
60: stonewalling pairs accessed: 60101
22: stonewalling pairs accessed: 72531
431: stonewalling pairs accessed: 72975
258: stonewalling pairs accessed: 72773
271: stonewalling pairs accessed: 72775
289: stonewalling pairs accessed: 62726
340: stonewalling pairs accessed: 72469
320: stonewalling pairs accessed: 70087
330: stonewalling pairs accessed: 70987
231: stonewalling pairs accessed: 72075
365: stonewalling pairs accessed: 72622
157: stonewalling pairs accessed: 75657
52: stonewalling pairs accessed: 67656
288: stonewalling pairs accessed: 62037
281: stonewalling pairs accessed: 65593
267: stonewalling pairs accessed: 72774
121: stonewalling pairs accessed: 64445
151: stonewalling pairs accessed: 71847
322: stonewalling pairs accessed: 70336
331: stonewalling pairs accessed: 71867
232: stonewalling pairs accessed: 72919
236: stonewalling pairs accessed: 72445
54: stonewalling pairs accessed: 63312
59: stonewalling pairs accessed: 67673
303: stonewalling pairs accessed: 67575
283: stonewalling pairs accessed: 65012
225: stonewalling pairs accessed: 72519
61: stonewalling pairs accessed: 60971
286: stonewalling pairs accessed: 68722
239: stonewalling pairs accessed: 69106
299: stonewalling pairs accessed: 65433
148: stonewalling pairs accessed: 72162
53: stonewalling pairs accessed: 60729
228: stonewalling pairs accessed: 71017
259: stonewalling pairs accessed: 69468
265: stonewalling pairs accessed: 70261
150: stonewalling pairs accessed: 70934
23: stonewalling pairs accessed: 69710
321: stonewalling pairs accessed: 72575
294: stonewalling pairs accessed: 65863
334: stonewalling pairs accessed: 72692
2: stonewalling pairs accessed: 27682
19: stonewalling pairs accessed: 69462
401: stonewalling pairs accessed: 64994
115: stonewalling pairs accessed: 63416
300: stonewalling pairs accessed: 65739
120: stonewalling pairs accessed: 62689
302: stonewalling pairs accessed: 64753
124: stonewalling pairs accessed: 65239
156: stonewalling pairs accessed: 72297
403: stonewalling pairs accessed: 64488
411: stonewalling pairs accessed: 64995
417: stonewalling pairs accessed: 65986
257: stonewalling pairs accessed: 70417
333: stonewalling pairs accessed: 71870
363: stonewalling pairs accessed: 72627
361: stonewalling pairs accessed: 72419
324: stonewalling pairs accessed: 72576
290: stonewalling pairs accessed: 67563
149: stonewalling pairs accessed: 72621
278: stonewalling pairs accessed: 63454
364: stonewalling pairs accessed: 72572
420: stonewalling pairs accessed: 67885
16: stonewalling pairs accessed: 70052
125: stonewalling pairs accessed: 67977
416: stonewalling pairs accessed: 72596
93: stonewalling pairs accessed: 71061
158: stonewalling pairs accessed: 72394
292: stonewalling pairs accessed: 67578
21: stonewalling pairs accessed: 69503
166: stonewalling pairs accessed: 72418
428: stonewalling pairs accessed: 72755
343: stonewalling pairs accessed: 70744
284: stonewalling pairs accessed: 68723
160: stonewalling pairs accessed: 69127
136: stonewalling pairs accessed: 72743
27: stonewalling pairs accessed: 72928
413: stonewalling pairs accessed: 65768
147: stonewalling pairs accessed: 71328
359: stonewalling pairs accessed: 72629
412: stonewalling pairs accessed: 67440
26: stonewalling pairs accessed: 71671
295: stonewalling pairs accessed: 67580
138: stonewalling pairs accessed: 71935
421: stonewalling pairs accessed: 72693
353: stonewalling pairs accessed: 69384
82: stonewalling pairs accessed: 72768
358: stonewalling pairs accessed: 72533
30: stonewalling pairs accessed: 71658
419: stonewalling pairs accessed: 72759
426: stonewalling pairs accessed: 72758
366: stonewalling pairs accessed: 72576
360: stonewalling pairs accessed: 72303
152: stonewalling pairs accessed: 72625
430: stonewalling pairs accessed: 72922
18: stonewalling pairs accessed: 67341
352: stonewalling pairs accessed: 72629
173: stonewalling pairs accessed: 72110
250: stonewalling pairs accessed: 70565
422: stonewalling pairs accessed: 72760
427: stonewalling pairs accessed: 72983
131: stonewalling pairs accessed: 71686
143: stonewalling pairs accessed: 70530
29: stonewalling pairs accessed: 72931
164: stonewalling pairs accessed: 71250
168: stonewalling pairs accessed: 72421
243: stonewalling pairs accessed: 71564
25: stonewalling pairs accessed: 69410
86: stonewalling pairs accessed: 76137
247: stonewalling pairs accessed: 72834
181: stonewalling pairs accessed: 63777
201: stonewalling pairs accessed: 66669
252: stonewalling pairs accessed: 72115
280: stonewalling pairs accessed: 68722
171: stonewalling pairs accessed: 72888
185: stonewalling pairs accessed: 63267
180: stonewalling pairs accessed: 62655
190: stonewalling pairs accessed: 64179
345: stonewalling pairs accessed: 69753
373: stonewalling pairs accessed: 65056
208: stonewalling pairs accessed: 65204
177: stonewalling pairs accessed: 62605
375: stonewalling pairs accessed: 64790
381: stonewalling pairs accessed: 65997
253: stonewalling pairs accessed: 72838
182: stonewalling pairs accessed: 67418
242: stonewalling pairs accessed: 71479
246: stonewalling pairs accessed: 71593
178: stonewalling pairs accessed: 67423
383: stonewalling pairs accessed: 61784
251: stonewalling pairs accessed: 70710
179: stonewalling pairs accessed: 60642
378: stonewalling pairs accessed: 66892
374: stonewalling pairs accessed: 68761
176: stonewalling pairs accessed: 67422
255: stonewalling pairs accessed: 72842
337: stonewalling pairs accessed: 67756
260: stonewalling pairs accessed: 69954
216: stonewalling pairs accessed: 73405
369: stonewalling pairs accessed: 64806
372: stonewalling pairs accessed: 63758
215: stonewalling pairs accessed: 72178
217: stonewalling pairs accessed: 72176
371: stonewalling pairs accessed: 62841
92: stonewalling pairs accessed: 76585
377: stonewalling pairs accessed: 63882
380: stonewalling pairs accessed: 68764
229: stonewalling pairs accessed: 72942
183: stonewalling pairs accessed: 62292
189: stonewalling pairs accessed: 67426
245: stonewalling pairs accessed: 69190
338: stonewalling pairs accessed: 72489
211: stonewalling pairs accessed: 66363
382: stonewalling pairs accessed: 68766
218: stonewalling pairs accessed: 73410
370: stonewalling pairs accessed: 68753
85: stonewalling pairs accessed: 72072
414: stonewalling pairs accessed: 66455
209: stonewalling pairs accessed: 65442
214: stonewalling pairs accessed: 71022
222: stonewalling pairs accessed: 73411
223: stonewalling pairs accessed: 70068
69: stonewalling pairs accessed: 68591
68: stonewalling pairs accessed: 63859
336: stonewalling pairs accessed: 70195
78: stonewalling pairs accessed: 68587
351: stonewalling pairs accessed: 71277
350: stonewalling pairs accessed: 72644
261: stonewalling pairs accessed: 70166
7: stonewalling pairs accessed: 34800
341: stonewalling pairs accessed: 67741
212: stonewalling pairs accessed: 71391
329: stonewalling pairs accessed: 70857
219: stonewalling pairs accessed: 70684
230: stonewalling pairs accessed: 68579
101: stonewalling pairs accessed: 68943
342: stonewalling pairs accessed: 68906
262: stonewalling pairs accessed: 70676
118: stonewalling pairs accessed: 68001
349: stonewalling pairs accessed: 71754
106: stonewalling pairs accessed: 68945
256: stonewalling pairs accessed: 68971
87: stonewalling pairs accessed: 71044
264: stonewalling pairs accessed: 72797
1: stonewalling pairs accessed: 26738
274: stonewalling pairs accessed: 68749
13: stonewalling pairs accessed: 32419
406: stonewalling pairs accessed: 64941
12: stonewalling pairs accessed: 34961
5: stonewalling pairs accessed: 33435
11: stonewalling pairs accessed: 33392
4: stonewalling pairs accessed: 30125
122: stonewalling pairs accessed: 68000
224: stonewalling pairs accessed: 70423
237: stonewalling pairs accessed: 72949
323: stonewalling pairs accessed: 71802
112: stonewalling pairs accessed: 68004
9: stonewalling pairs accessed: 32732
339: stonewalling pairs accessed: 68958
8: stonewalling pairs accessed: 34966
95: stonewalling pairs accessed: 77099
133: stonewalling pairs accessed: 72765
263: stonewalling pairs accessed: 71348
327: stonewalling pairs accessed: 69719
227: stonewalling pairs accessed: 72548
279: stonewalling pairs accessed: 66535
328: stonewalling pairs accessed: 70495
409: stonewalling pairs accessed: 66232
114: stonewalling pairs accessed: 61701
277: stonewalling pairs accessed: 68758
226: stonewalling pairs accessed: 71242
273: stonewalling pairs accessed: 65068
266: stonewalling pairs accessed: 72013
127: stonewalling pairs accessed: 68006
14: stonewalling pairs accessed: 32567
276: stonewalling pairs accessed: 65063
272: stonewalling pairs accessed: 62871
269: stonewalling pairs accessed: 70737
10: stonewalling pairs accessed: 32883
80: stonewalling pairs accessed: 64664
137: stonewalling pairs accessed: 69677
234: stonewalling pairs accessed: 72567
15: stonewalling pairs accessed: 34951
123: stonewalling pairs accessed: 59353
235: stonewalling pairs accessed: 72952
270: stonewalling pairs accessed: 72636
238: stonewalling pairs accessed: 68297
116: stonewalling pairs accessed: 68006
400: stonewalling pairs accessed: 61825
6: stonewalling pairs accessed: 34762
285: stonewalling pairs accessed: 65611
325: stonewalling pairs accessed: 71508
113: stonewalling pairs accessed: 60668
3: stonewalling pairs accessed: 28742
326: stonewalling pairs accessed: 72601
169: stonewalling pairs accessed: 70660
275: stonewalling pairs accessed: 62435
90: stonewalling pairs accessed: 78669
335: stonewalling pairs accessed: 68892
407: stonewalling pairs accessed: 67471
161: stonewalling pairs accessed: 70567
117: stonewalling pairs accessed: 63422
332: stonewalling pairs accessed: 72730
126: stonewalling pairs accessed: 63926
268: stonewalling pairs accessed: 69361
402: stonewalling pairs accessed: 64042
405: stonewalling pairs accessed: 63065
415: stonewalling pairs accessed: 66910
119: stonewalling pairs accessed: 60001
162: stonewalling pairs accessed: 71602
130: stonewalling pairs accessed: 68438
410: stonewalling pairs accessed: 67472
142: stonewalling pairs accessed: 72779
94: stonewalling pairs accessed: 72935
140: stonewalling pairs accessed: 71820
129: stonewalling pairs accessed: 71952
128: stonewalling pairs accessed: 66283
408: stonewalling pairs accessed: 64832
84: stonewalling pairs accessed: 72815
139: stonewalling pairs accessed: 72780
172: stonewalling pairs accessed: 71046
170: stonewalling pairs accessed: 72143
167: stonewalling pairs accessed: 71604
134: stonewalling pairs accessed: 68476
404: stonewalling pairs accessed: 67468
174: stonewalling pairs accessed: 72919
204: stonewalling pairs accessed: 67081
135: stonewalling pairs accessed: 71905
194: stonewalling pairs accessed: 63311
165: stonewalling pairs accessed: 71537
132: stonewalling pairs accessed: 71847
91: stonewalling pairs accessed: 72110
89: stonewalling pairs accessed: 72110
192: stonewalling pairs accessed: 63526
202: stonewalling pairs accessed: 67084
141: stonewalling pairs accessed: 71961
175: stonewalling pairs accessed: 67081
199: stonewalling pairs accessed: 67084
206: stonewalling pairs accessed: 67085
196: stonewalling pairs accessed: 63525
83: stonewalling pairs accessed: 69104
88: stonewalling pairs accessed: 72949
163: stonewalling pairs accessed: 69083
197: stonewalling pairs accessed: 67084
207: stonewalling pairs accessed: 63259
203: stonewalling pairs accessed: 63813
195: stonewalling pairs accessed: 67079
205: stonewalling pairs accessed: 65744
81: stonewalling pairs accessed: 66311
198: stonewalling pairs accessed: 63813
193: stonewalling pairs accessed: 62651
200: stonewalling pairs accessed: 66229
67: stonewalling pairs accessed: 66181
64: stonewalling pairs accessed: 59446
102: stonewalling pairs accessed: 65345
71: stonewalling pairs accessed: 66097
65: stonewalling pairs accessed: 66178
99: stonewalling pairs accessed: 62658
109: stonewalling pairs accessed: 68983
79: stonewalling pairs accessed: 62510
74: stonewalling pairs accessed: 68635
98: stonewalling pairs accessed: 68971
100: stonewalling pairs accessed: 65857
108: stonewalling pairs accessed: 65905
76: stonewalling pairs accessed: 65674
97: stonewalling pairs accessed: 65327
96: stonewalling pairs accessed: 63202
103: stonewalling pairs accessed: 68985
73: stonewalling pairs accessed: 65403
111: stonewalling pairs accessed: 64026
105: stonewalling pairs accessed: 65559
66: stonewalling pairs accessed: 65078
72: stonewalling pairs accessed: 68638
104: stonewalling pairs accessed: 64801
107: stonewalling pairs accessed: 62528
77: stonewalling pairs accessed: 67052
70: stonewalling pairs accessed: 63951
75: stonewalling pairs accessed: 63410
110: stonewalling pairs accessed: 63150

I've sort them and the smallest number was ~26000

I've checked dmesg on several nodes - it seems that's everything is ok. Right now I will start 5s stonewall and check dmesg in the same time.

abrekhov commented 4 years ago

Here is new output with stonewall 5s.

ERROR INVALID stonewall-time != 300s
IO500 version io500-isc20_v3-2-g46e0e53a3abf
[RESULT-invalid]       ior-easy-write        0.000000 GiB/s  : time 0.000 seconds
ERROR INVALID Runtime of phase (0.000018) is below stonewall time. This shouldn't happen!
[RESULT-invalid]    mdtest-easy-write        0.000000 kIOPS : time 0.000 seconds
ERROR INVALID Runtime of phase (0.000012) is below stonewall time. This shouldn't happen!
219: stonewalling pairs accessed: 1138
208: stonewalling pairs accessed: 1075
210: stonewalling pairs accessed: 1054
223: stonewalling pairs accessed: 1312
213: stonewalling pairs accessed: 1169
221: stonewalling pairs accessed: 1184
215: stonewalling pairs accessed: 1141
216: stonewalling pairs accessed: 1184
209: stonewalling pairs accessed: 1190
217: stonewalling pairs accessed: 1213
218: stonewalling pairs accessed: 994
212: stonewalling pairs accessed: 1130
220: stonewalling pairs accessed: 1157
222: stonewalling pairs accessed: 1210
211: stonewalling pairs accessed: 1083
214: stonewalling pairs accessed: 1214
194: stonewalling pairs accessed: 915
201: stonewalling pairs accessed: 1057
200: stonewalling pairs accessed: 1041
197: stonewalling pairs accessed: 1063
204: stonewalling pairs accessed: 1135
203: stonewalling pairs accessed: 1080
207: stonewalling pairs accessed: 1124
199: stonewalling pairs accessed: 1002
202: stonewalling pairs accessed: 1136
206: stonewalling pairs accessed: 1129
205: stonewalling pairs accessed: 1064
195: stonewalling pairs accessed: 978
198: stonewalling pairs accessed: 960
80: stonewalling pairs accessed: 1175
94: stonewalling pairs accessed: 1076
84: stonewalling pairs accessed: 1211
91: stonewalling pairs accessed: 1060
193: stonewalling pairs accessed: 1060
88: stonewalling pairs accessed: 1065
93: stonewalling pairs accessed: 1067
92: stonewalling pairs accessed: 1211
365: stonewalling pairs accessed: 1139
82: stonewalling pairs accessed: 1056
85: stonewalling pairs accessed: 1066
81: stonewalling pairs accessed: 1130
48: stonewalling pairs accessed: 903
319: stonewalling pairs accessed: 1281
57: stonewalling pairs accessed: 1079
86: stonewalling pairs accessed: 1189
315: stonewalling pairs accessed: 1133
359: stonewalling pairs accessed: 1122
305: stonewalling pairs accessed: 869
363: stonewalling pairs accessed: 1185
59: stonewalling pairs accessed: 1017
56: stonewalling pairs accessed: 1106
61: stonewalling pairs accessed: 1012
312: stonewalling pairs accessed: 1269
356: stonewalling pairs accessed: 1118
52: stonewalling pairs accessed: 1055
60: stonewalling pairs accessed: 1109
353: stonewalling pairs accessed: 1077
63: stonewalling pairs accessed: 1109
253: stonewalling pairs accessed: 1123
249: stonewalling pairs accessed: 1285
364: stonewalling pairs accessed: 1134
252: stonewalling pairs accessed: 1299
306: stonewalling pairs accessed: 877
196: stonewalling pairs accessed: 1036
251: stonewalling pairs accessed: 1149
246: stonewalling pairs accessed: 1120
240: stonewalling pairs accessed: 1132
244: stonewalling pairs accessed: 1238
54: stonewalling pairs accessed: 1094
247: stonewalling pairs accessed: 1147
367: stonewalling pairs accessed: 1139
58: stonewalling pairs accessed: 1103
250: stonewalling pairs accessed: 1125
304: stonewalling pairs accessed: 857
311: stonewalling pairs accessed: 1222
53: stonewalling pairs accessed: 1009
255: stonewalling pairs accessed: 1302
314: stonewalling pairs accessed: 1273
248: stonewalling pairs accessed: 1250
316: stonewalling pairs accessed: 1158
243: stonewalling pairs accessed: 1134
245: stonewalling pairs accessed: 1091
310: stonewalling pairs accessed: 1274
307: stonewalling pairs accessed: 887
62: stonewalling pairs accessed: 1050
308: stonewalling pairs accessed: 1266
313: stonewalling pairs accessed: 1170
354: stonewalling pairs accessed: 1119
317: stonewalling pairs accessed: 1279
241: stonewalling pairs accessed: 1054
90: stonewalling pairs accessed: 1099
192: stonewalling pairs accessed: 1036
288: stonewalling pairs accessed: 1103
294: stonewalling pairs accessed: 1018
361: stonewalling pairs accessed: 987
309: stonewalling pairs accessed: 920
95: stonewalling pairs accessed: 1100
318: stonewalling pairs accessed: 957
49: stonewalling pairs accessed: 933
298: stonewalling pairs accessed: 1126
366: stonewalling pairs accessed: 1173
281: stonewalling pairs accessed: 1055
358: stonewalling pairs accessed: 1163
362: stonewalling pairs accessed: 1131
290: stonewalling pairs accessed: 1030
357: stonewalling pairs accessed: 1147
302: stonewalling pairs accessed: 1033
355: stonewalling pairs accessed: 1160
55: stonewalling pairs accessed: 1017
289: stonewalling pairs accessed: 961
301: stonewalling pairs accessed: 989
299: stonewalling pairs accessed: 1062
293: stonewalling pairs accessed: 974
300: stonewalling pairs accessed: 1127
242: stonewalling pairs accessed: 1110
285: stonewalling pairs accessed: 1048
254: stonewalling pairs accessed: 1192
272: stonewalling pairs accessed: 875
291: stonewalling pairs accessed: 1118
275: stonewalling pairs accessed: 931
20: stonewalling pairs accessed: 1085
279: stonewalling pairs accessed: 1069
89: stonewalling pairs accessed: 1077
296: stonewalling pairs accessed: 1127
16: stonewalling pairs accessed: 1075
282: stonewalling pairs accessed: 1125
368: stonewalling pairs accessed: 932
292: stonewalling pairs accessed: 947
23: stonewalling pairs accessed: 1170
283: stonewalling pairs accessed: 1008
24: stonewalling pairs accessed: 1111
273: stonewalling pairs accessed: 894
375: stonewalling pairs accessed: 1045
30: stonewalling pairs accessed: 1091
18: stonewalling pairs accessed: 1087
295: stonewalling pairs accessed: 1060
371: stonewalling pairs accessed: 956
17: stonewalling pairs accessed: 1142
28: stonewalling pairs accessed: 1139
280: stonewalling pairs accessed: 1011
379: stonewalling pairs accessed: 1066
274: stonewalling pairs accessed: 925
297: stonewalling pairs accessed: 1061
383: stonewalling pairs accessed: 1091
381: stonewalling pairs accessed: 1016
26: stonewalling pairs accessed: 1134
22: stonewalling pairs accessed: 1133
382: stonewalling pairs accessed: 1030
25: stonewalling pairs accessed: 1038
377: stonewalling pairs accessed: 1001
87: stonewalling pairs accessed: 1220
303: stonewalling pairs accessed: 1129
374: stonewalling pairs accessed: 1009
287: stonewalling pairs accessed: 1015
50: stonewalling pairs accessed: 981
378: stonewalling pairs accessed: 1026
360: stonewalling pairs accessed: 1167
372: stonewalling pairs accessed: 1017
421: stonewalling pairs accessed: 1107
426: stonewalling pairs accessed: 1145
83: stonewalling pairs accessed: 1163
120: stonewalling pairs accessed: 997
115: stonewalling pairs accessed: 976
380: stonewalling pairs accessed: 998
286: stonewalling pairs accessed: 1097
112: stonewalling pairs accessed: 1101
126: stonewalling pairs accessed: 1021
431: stonewalling pairs accessed: 1164
125: stonewalling pairs accessed: 990
51: stonewalling pairs accessed: 1009
425: stonewalling pairs accessed: 1139
418: stonewalling pairs accessed: 1079
416: stonewalling pairs accessed: 1019
423: stonewalling pairs accessed: 1097
122: stonewalling pairs accessed: 1103
119: stonewalling pairs accessed: 977
27: stonewalling pairs accessed: 1161
419: stonewalling pairs accessed: 1088
127: stonewalling pairs accessed: 1143
376: stonewalling pairs accessed: 1068
121: stonewalling pairs accessed: 1063
116: stonewalling pairs accessed: 996
113: stonewalling pairs accessed: 944
29: stonewalling pairs accessed: 1166
422: stonewalling pairs accessed: 1087
117: stonewalling pairs accessed: 920
427: stonewalling pairs accessed: 1165
370: stonewalling pairs accessed: 1040
118: stonewalling pairs accessed: 945
31: stonewalling pairs accessed: 1183
420: stonewalling pairs accessed: 1097
114: stonewalling pairs accessed: 945
19: stonewalling pairs accessed: 1141
428: stonewalling pairs accessed: 1127
430: stonewalling pairs accessed: 1155
21: stonewalling pairs accessed: 1173
417: stonewalling pairs accessed: 1035
277: stonewalling pairs accessed: 983
424: stonewalling pairs accessed: 1132
369: stonewalling pairs accessed: 993
232: stonewalling pairs accessed: 1168
429: stonewalling pairs accessed: 1143
124: stonewalling pairs accessed: 1107
234: stonewalling pairs accessed: 1073
238: stonewalling pairs accessed: 1083
224: stonewalling pairs accessed: 987
226: stonewalling pairs accessed: 1012
123: stonewalling pairs accessed: 1023
230: stonewalling pairs accessed: 1016
229: stonewalling pairs accessed: 1189
373: stonewalling pairs accessed: 1068
225: stonewalling pairs accessed: 1001
237: stonewalling pairs accessed: 1242
227: stonewalling pairs accessed: 1017
384: stonewalling pairs accessed: 944
235: stonewalling pairs accessed: 1240
392: stonewalling pairs accessed: 1014
239: stonewalling pairs accessed: 1162
398: stonewalling pairs accessed: 1077
390: stonewalling pairs accessed: 973
386: stonewalling pairs accessed: 988
341: stonewalling pairs accessed: 1215
336: stonewalling pairs accessed: 1223
408: stonewalling pairs accessed: 1105
233: stonewalling pairs accessed: 1236
396: stonewalling pairs accessed: 1170
344: stonewalling pairs accessed: 1214
345: stonewalling pairs accessed: 1185
401: stonewalling pairs accessed: 994
400: stonewalling pairs accessed: 960
393: stonewalling pairs accessed: 1030
407: stonewalling pairs accessed: 1063
346: stonewalling pairs accessed: 1067
397: stonewalling pairs accessed: 1115
338: stonewalling pairs accessed: 1225
413: stonewalling pairs accessed: 1117
391: stonewalling pairs accessed: 997
406: stonewalling pairs accessed: 1024
395: stonewalling pairs accessed: 1123
405: stonewalling pairs accessed: 1115
284: stonewalling pairs accessed: 1096
385: stonewalling pairs accessed: 965
340: stonewalling pairs accessed: 1160
351: stonewalling pairs accessed: 1241
343: stonewalling pairs accessed: 1164
411: stonewalling pairs accessed: 1038
350: stonewalling pairs accessed: 1186
339: stonewalling pairs accessed: 1099
415: stonewalling pairs accessed: 976
347: stonewalling pairs accessed: 1109
387: stonewalling pairs accessed: 1030
389: stonewalling pairs accessed: 1030
394: stonewalling pairs accessed: 1077
403: stonewalling pairs accessed: 1025
410: stonewalling pairs accessed: 1025
399: stonewalling pairs accessed: 1122
278: stonewalling pairs accessed: 1021
337: stonewalling pairs accessed: 1099
342: stonewalling pairs accessed: 1103
388: stonewalling pairs accessed: 1006
228: stonewalling pairs accessed: 1030
236: stonewalling pairs accessed: 1065
231: stonewalling pairs accessed: 1034
409: stonewalling pairs accessed: 1106
412: stonewalling pairs accessed: 1091
414: stonewalling pairs accessed: 1102
102: stonewalling pairs accessed: 999
108: stonewalling pairs accessed: 1120
104: stonewalling pairs accessed: 970
110: stonewalling pairs accessed: 1145
109: stonewalling pairs accessed: 1015
111: stonewalling pairs accessed: 1004
106: stonewalling pairs accessed: 1138
107: stonewalling pairs accessed: 1098
404: stonewalling pairs accessed: 1027
64: stonewalling pairs accessed: 1043
77: stonewalling pairs accessed: 1100
402: stonewalling pairs accessed: 1091
270: stonewalling pairs accessed: 1168
76: stonewalling pairs accessed: 1169
185: stonewalling pairs accessed: 1123
75: stonewalling pairs accessed: 1090
181: stonewalling pairs accessed: 1123
348: stonewalling pairs accessed: 1104
349: stonewalling pairs accessed: 1040
256: stonewalling pairs accessed: 1036
257: stonewalling pairs accessed: 1049
263: stonewalling pairs accessed: 1114
72: stonewalling pairs accessed: 1126
65: stonewalling pairs accessed: 1005
188: stonewalling pairs accessed: 1125
70: stonewalling pairs accessed: 1002
74: stonewalling pairs accessed: 1141
69: stonewalling pairs accessed: 1119
99: stonewalling pairs accessed: 1012
189: stonewalling pairs accessed: 955
264: stonewalling pairs accessed: 1136
78: stonewalling pairs accessed: 1164
100: stonewalling pairs accessed: 981
101: stonewalling pairs accessed: 1034
103: stonewalling pairs accessed: 1035
177: stonewalling pairs accessed: 962
73: stonewalling pairs accessed: 1076
96: stonewalling pairs accessed: 980
276: stonewalling pairs accessed: 953
266: stonewalling pairs accessed: 1160
67: stonewalling pairs accessed: 1006
183: stonewalling pairs accessed: 1045
176: stonewalling pairs accessed: 935
269: stonewalling pairs accessed: 1168
79: stonewalling pairs accessed: 1095
182: stonewalling pairs accessed: 994
186: stonewalling pairs accessed: 1019
180: stonewalling pairs accessed: 991
184: stonewalling pairs accessed: 1097
259: stonewalling pairs accessed: 1068
71: stonewalling pairs accessed: 1093
190: stonewalling pairs accessed: 996
268: stonewalling pairs accessed: 1119
191: stonewalling pairs accessed: 1023
68: stonewalling pairs accessed: 1083
98: stonewalling pairs accessed: 981
97: stonewalling pairs accessed: 1010
105: stonewalling pairs accessed: 994
265: stonewalling pairs accessed: 1111
152: stonewalling pairs accessed: 1245
158: stonewalling pairs accessed: 1311
149: stonewalling pairs accessed: 1285
146: stonewalling pairs accessed: 1213
159: stonewalling pairs accessed: 1446
151: stonewalling pairs accessed: 1200
260: stonewalling pairs accessed: 1119
66: stonewalling pairs accessed: 1085
153: stonewalling pairs accessed: 1293
262: stonewalling pairs accessed: 1158
144: stonewalling pairs accessed: 847
154: stonewalling pairs accessed: 1331
267: stonewalling pairs accessed: 1141
271: stonewalling pairs accessed: 1131
155: stonewalling pairs accessed: 1421
147: stonewalling pairs accessed: 1290
178: stonewalling pairs accessed: 996
258: stonewalling pairs accessed: 1101
150: stonewalling pairs accessed: 1413
261: stonewalling pairs accessed: 1131
187: stonewalling pairs accessed: 1077
156: stonewalling pairs accessed: 1415
157: stonewalling pairs accessed: 1415
145: stonewalling pairs accessed: 1187
179: stonewalling pairs accessed: 1127
9: stonewalling pairs accessed: 592
11: stonewalling pairs accessed: 576
3: stonewalling pairs accessed: 569
12: stonewalling pairs accessed: 620
148: stonewalling pairs accessed: 1448
14: stonewalling pairs accessed: 585
4: stonewalling pairs accessed: 607
1: stonewalling pairs accessed: 569
7: stonewalling pairs accessed: 573
171: stonewalling pairs accessed: 1165
174: stonewalling pairs accessed: 1306
167: stonewalling pairs accessed: 1194
162: stonewalling pairs accessed: 1097
352: stonewalling pairs accessed: 1059
160: stonewalling pairs accessed: 1042
8: stonewalling pairs accessed: 674
170: stonewalling pairs accessed: 1317
173: stonewalling pairs accessed: 1172
165: stonewalling pairs accessed: 1070
5: stonewalling pairs accessed: 640
164: stonewalling pairs accessed: 1101
166: stonewalling pairs accessed: 1148
168: stonewalling pairs accessed: 1157
10: stonewalling pairs accessed: 673
15: stonewalling pairs accessed: 692
325: stonewalling pairs accessed: 1156
6: stonewalling pairs accessed: 656
13: stonewalling pairs accessed: 692
2: stonewalling pairs accessed: 642
330: stonewalling pairs accessed: 1220
321: stonewalling pairs accessed: 1067
324: stonewalling pairs accessed: 1126
329: stonewalling pairs accessed: 1054
175: stonewalling pairs accessed: 1205
327: stonewalling pairs accessed: 1121
331: stonewalling pairs accessed: 1084
161: stonewalling pairs accessed: 1065
169: stonewalling pairs accessed: 1173
129: stonewalling pairs accessed: 1159
332: stonewalling pairs accessed: 1070
326: stonewalling pairs accessed: 1128
323: stonewalling pairs accessed: 1060
335: stonewalling pairs accessed: 1087
136: stonewalling pairs accessed: 1089
333: stonewalling pairs accessed: 1140
320: stonewalling pairs accessed: 961
131: stonewalling pairs accessed: 1166
334: stonewalling pairs accessed: 1065
135: stonewalling pairs accessed: 1065
132: stonewalling pairs accessed: 1114
328: stonewalling pairs accessed: 1128
128: stonewalling pairs accessed: 1102
130: stonewalling pairs accessed: 1105
322: stonewalling pairs accessed: 988
172: stonewalling pairs accessed: 1132
140: stonewalling pairs accessed: 1173
133: stonewalling pairs accessed: 1153
138: stonewalling pairs accessed: 1172
139: stonewalling pairs accessed: 1157
143: stonewalling pairs accessed: 1174
163: stonewalling pairs accessed: 1036
137: stonewalling pairs accessed: 1153
142: stonewalling pairs accessed: 1160
141: stonewalling pairs accessed: 1126
134: stonewalling pairs accessed: 1171

This is the ior-hard-write.txt

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O
Began               : Sat Jun 20 00:02:30 2020
Command line        : ./ior -v -C -Q 1 -g -G 27 -k -e -o /lustre/iospb/io500-2020spring/datafiles/2020.06.19-23.56.09-app/ior-hard/file -O stoneWallingStatusFile=/lustre/iospb/io500-2020spring/datafiles/2020.06.19-23.56.09-app/ior-hard/stonewall -O stoneWallingWearOut=1 -t 47008 -b 47008 -s 52000 -w -D 5 -a MPIIO
Machine             : Linux n06p001
Start time skew across all tasks: 847477.60 sec
TestID              : 0
StartTime           : Sat Jun 20 00:02:30 2020
Path                : /lustre/iospb/io500-2020spring/datafiles/2020.06.19-23.56.09-app/ior-hard
FS                  : 84.6 TiB   Used FS: 18.8%   Inodes: 342.0 Mi   Used Inodes: 5.4%
Participating tasks: 432
Using reorderTasks '-C' (useful to avoid read cache in client)

Options:
api                 : MPIIO
apiVersion          : (3.1)
test filename       : /lustre/iospb/io500-2020spring/datafiles/2020.06.19-23.56.09-app/ior-hard/file
access              : single-shared-file
type                : independent
segments            : 52000
ordering in a file  : sequential
ordering inter file : constant task offset
task offset         : 1
nodes               : 27
tasks               : 432
clients per node    : 16
repetitions         : 1
xfersize            : 47008 bytes
blocksize           : 47008 bytes
aggregate filesize  : 983.47 GiB
stonewallingTime    : 5
stoneWallingWearOut : 1

Results:

I found node with such Lnet Error

[jun20 00:01] LNetError: 75447:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.252.243.155@o2ib added to recovery queue. Health = 900
[ +12,690730] LNetError: 69416:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.252.243.109@o2ib added to recovery queue. Health = 900
[jun20 00:05] LNetError: 76190:0:(o2iblnd_cb.c:2924:kiblnd_rejected()) 10.252.243.137@o2ib rejected: consumer defined fatal error
[  +0,001781] LNetError: 76190:0:(o2iblnd_cb.c:2924:kiblnd_rejected()) Skipped 845 previous similar messages

It seems that you're right. This node have some issues even when I calls lfs df -h I'll try without it.

abrekhov commented 4 years ago

Julian, the reboot helped to solve the problem. It seems like some targets was unmounted during high load. Am I getting right that for submission I need to wait for both shell and C version of application? So it's about 3 hours instead of previous one and half? Thanks!

adilger commented 4 years ago

@abrekhov for the ISC'20 list there should be results submitted for both the C-app and Bash versions of IO500, so that the results can be compared to ensure continuity with previous submissions.

abrekhov commented 4 years ago

Thanks for reply! So here's one more problem with ior_hard in C version: slurm.out

[RESULT]       ior-hard-write        3.486921 GiB/s  : time 315.272 seconds
ior ERROR: open64("/lustre/iospb/io500-2020spring/datafiles/2020.06.22-20.07.39-app/mdtest-hard/test-dir.0-0/mdtest_tree.0/file.mdtest.332.22164", 66, 0664) failed, errno 28, No space left on device (aiori-POSIX.c:413)
Abort(-1) on node 332 (rank 332 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 332

But lfs df -h and -i tell that there's plenty of space:

[root@n06p001 ~]# lfs df -h
UUID                       bytes        Used   Available Use% Mounted on
iospb-MDT0000_UUID        204.0G        4.6G      181.9G   3% /lustre/iospb[MDT:0]
iospb-MDT0001_UUID        204.0G        1.2G      185.3G   1% /lustre/iospb[MDT:1]
iospb-MDT0002_UUID        204.0G        1.2G      185.3G   1% /lustre/iospb[MDT:2]
iospb-MDT0003_UUID        204.0G        1.2G      185.3G   1% /lustre/iospb[MDT:3]
iospb-OST0000_UUID          1.8T      814.6G      936.0G  47% /lustre/iospb[OST:0]
iospb-OST0001_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:1]
iospb-OST0002_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:2]
iospb-OST0003_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:3]
iospb-OST0004_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:4]
iospb-OST0005_UUID          1.8T      814.6G      936.0G  47% /lustre/iospb[OST:5]
iospb-OST0006_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:6]
iospb-OST0007_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:7]
iospb-OST0008_UUID          1.8T      814.6G      936.0G  47% /lustre/iospb[OST:8]
iospb-OST0009_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:9]
iospb-OST000a_UUID          1.8T      862.3G      888.3G  50% /lustre/iospb[OST:10]
iospb-OST000b_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:11]
iospb-OST000c_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:12]
iospb-OST000d_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:13]
iospb-OST000e_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:14]
iospb-OST000f_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:15]
iospb-OST0010_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:16]
iospb-OST0011_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:17]
iospb-OST0012_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:18]
iospb-OST0013_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:19]
iospb-OST0014_UUID          1.8T      814.6G      936.0G  47% /lustre/iospb[OST:20]
iospb-OST0015_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:21]
iospb-OST0016_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:22]
iospb-OST0017_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:23]
iospb-OST0018_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:24]
iospb-OST0019_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:25]
iospb-OST001a_UUID          1.8T      814.6G      936.0G  47% /lustre/iospb[OST:26]
iospb-OST001b_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:27]
iospb-OST001c_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:28]
iospb-OST001d_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:29]
iospb-OST001e_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:30]
iospb-OST001f_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:31]
iospb-OST0020_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:32]
iospb-OST0021_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:33]
iospb-OST0022_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:34]
iospb-OST0023_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:35]
iospb-OST0024_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:36]
iospb-OST0025_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:37]
iospb-OST0026_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:38]
iospb-OST0027_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:39]
iospb-OST0028_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:40]
iospb-OST0029_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:41]
iospb-OST002a_UUID          1.8T      814.6G      936.0G  47% /lustre/iospb[OST:42]
iospb-OST002c_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:44]
iospb-OST002d_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:45]
iospb-OST002e_UUID          1.8T      910.5G      840.1G  53% /lustre/iospb[OST:46]
iospb-OST002f_UUID          1.8T      862.9G      887.7G  50% /lustre/iospb[OST:47]

filesystem_summary:        84.6T       41.1T       39.2T  52% /lustre/iospb

[root@n06p001 ~]# lfs df -i
UUID                      Inodes       IUsed       IFree IUse% Mounted on
iospb-MDT0000_UUID     146499696    41226803   105272893  29% /lustre/iospb[MDT:0]
iospb-MDT0001_UUID     146499696         355   146499341   1% /lustre/iospb[MDT:1]
iospb-MDT0002_UUID     146499696         355   146499341   1% /lustre/iospb[MDT:2]
iospb-MDT0003_UUID     146499696         355   146499341   1% /lustre/iospb[MDT:3]
iospb-OST0000_UUID       7631360      887313     6744047  12% /lustre/iospb[OST:0]
iospb-OST0001_UUID       7631360      878865     6752495  12% /lustre/iospb[OST:1]
iospb-OST0002_UUID       7631360      880401     6750959  12% /lustre/iospb[OST:2]
iospb-OST0003_UUID       7631360      879633     6751727  12% /lustre/iospb[OST:3]
iospb-OST0004_UUID       7631360      879761     6751599  12% /lustre/iospb[OST:4]
iospb-OST0005_UUID       7631360      879505     6751855  12% /lustre/iospb[OST:5]
iospb-OST0006_UUID       7631360      883281     6748079  12% /lustre/iospb[OST:6]
iospb-OST0007_UUID       7631360      879633     6751727  12% /lustre/iospb[OST:7]
iospb-OST0008_UUID       7631360      880081     6751279  12% /lustre/iospb[OST:8]
iospb-OST0009_UUID       7631360      880401     6750959  12% /lustre/iospb[OST:9]
iospb-OST000a_UUID       7631360      885329     6746031  12% /lustre/iospb[OST:10]
iospb-OST000b_UUID       7631360      884625     6746735  12% /lustre/iospb[OST:11]
iospb-OST000c_UUID       7631360      881361     6749999  12% /lustre/iospb[OST:12]
iospb-OST000d_UUID       7631360      878929     6752431  12% /lustre/iospb[OST:13]
iospb-OST000e_UUID       7631360      883985     6747375  12% /lustre/iospb[OST:14]
iospb-OST000f_UUID       7631360      879761     6751599  12% /lustre/iospb[OST:15]
iospb-OST0010_UUID       7631360      879825     6751535  12% /lustre/iospb[OST:16]
iospb-OST0011_UUID       7631360      888177     6743183  12% /lustre/iospb[OST:17]
iospb-OST0012_UUID       7631360      877425     6753935  12% /lustre/iospb[OST:18]
iospb-OST0013_UUID       7631360      878353     6753007  12% /lustre/iospb[OST:19]
iospb-OST0014_UUID       7631360      882193     6749167  12% /lustre/iospb[OST:20]
iospb-OST0015_UUID       7631360      880017     6751343  12% /lustre/iospb[OST:21]
iospb-OST0016_UUID       7631360      879953     6751407  12% /lustre/iospb[OST:22]
iospb-OST0017_UUID       7631360      878449     6752911  12% /lustre/iospb[OST:23]
iospb-OST0018_UUID       7631360      877009     6754351  12% /lustre/iospb[OST:24]
iospb-OST0019_UUID       7631360      882289     6749071  12% /lustre/iospb[OST:25]
iospb-OST001a_UUID       7631360      887985     6743375  12% /lustre/iospb[OST:26]
iospb-OST001b_UUID       7631360      879505     6751855  12% /lustre/iospb[OST:27]
iospb-OST001c_UUID       7631360      883280     6748080  12% /lustre/iospb[OST:28]
iospb-OST001d_UUID       7631360      880305     6751055  12% /lustre/iospb[OST:29]
iospb-OST001e_UUID       7631360      880017     6751343  12% /lustre/iospb[OST:30]
iospb-OST001f_UUID       7631360      887729     6743631  12% /lustre/iospb[OST:31]
iospb-OST0020_UUID       7631360      879569     6751791  12% /lustre/iospb[OST:32]
iospb-OST0021_UUID       7631360      882993     6748367  12% /lustre/iospb[OST:33]
iospb-OST0022_UUID       7631360      878161     6753199  12% /lustre/iospb[OST:34]
iospb-OST0023_UUID       7631360      878545     6752815  12% /lustre/iospb[OST:35]
iospb-OST0024_UUID       7631360      881713     6749647  12% /lustre/iospb[OST:36]
iospb-OST0025_UUID       7631360      881521     6749839  12% /lustre/iospb[OST:37]
iospb-OST0026_UUID       7631360      879889     6751471  12% /lustre/iospb[OST:38]
iospb-OST0027_UUID       7631360      879793     6751567  12% /lustre/iospb[OST:39]
iospb-OST0028_UUID       7631360      877457     6753903  12% /lustre/iospb[OST:40]
iospb-OST0029_UUID       7631360      878417     6752943  12% /lustre/iospb[OST:41]
iospb-OST002a_UUID       7631360      888017     6743343  12% /lustre/iospb[OST:42]
iospb-OST002c_UUID       7631360      888273     6743087  12% /lustre/iospb[OST:44]
iospb-OST002d_UUID       7631360      879889     6751471  12% /lustre/iospb[OST:45]
iospb-OST002e_UUID       7631360      879345     6752015  12% /lustre/iospb[OST:46]
iospb-OST002f_UUID       7631360      879825     6751535  12% /lustre/iospb[OST:47]

filesystem_summary:    358483006    41227868   317255138  12% /lustre/iospb

Any ideas how to solve this issue? Thanks! UPD: It seems like last phase was mdtest_hard_write . But FS has a lot of space left anyway. mdtest-hard-write.txt

-- started at 06/22/2020 21:56:44 --

mdtest-3.3.0+dev was launched with 416 total task(s) on 26 node(s)
Command line used: ./mdtest '-n' '1000000' '-t' '-w' '3901' '-e' '3901' '-P' '-N' '1' '-F' '-d' '/lustre/iospb/io500-2020spring/datafiles/2020.06.22-20.07.39-app/mdtest-hard' '-x' '/lustre/iospb/io500-2020spring/datafiles/2020.06.22-20.07.39-app/mdtest-hard-stonewall' '-C' '-Y' '-W' '300' '-a' 'POSIX'
Path: /lustre/iospb/io500-2020spring/datafiles/2020.06.22-20.07.39-app
FS: 84.6 TiB   Used FS: 48.6%   Inodes: 341.9 Mi   Used Inodes: 9.1%

Nodemap: 11111111111111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
V-0: Rank   0 Line  2166 Shifting ranks by 16 for each phase.
416 tasks, 416000000 files
adilger commented 4 years ago

I would suspect that you are running into the single-directory size limits of ldiskfs, which is about 10-12M files per directory (depends on filename length). It is possible to increase the directory size limit by enabling the large_dir feature on the MDTs via tune2fs -O large_dir /dev/<MDT> (this is enabled by default on newer versions of Lustre).

However, it is also visible from the lfs df -i output that you are only using a single MDT for your testing.

# lfs df -i
UUID                      Inodes       IUsed       IFree IUse% Mounted on
iospb-MDT0000_UUID     146499696    41226803   105272893  29% /lustre/iospb[MDT:0]
iospb-MDT0001_UUID     146499696         355   146499341   1% /lustre/iospb[MDT:1]
iospb-MDT0002_UUID     146499696         355   146499341   1% /lustre/iospb[MDT:2]
iospb-MDT0003_UUID     146499696         355   146499341   1% /lustre/iospb[MDT:3]

The better way to increase the directory size limit is to stripe the directories over all of the MDTs. This should also improve the mdtest performance that you are seeing. You can add a line into io500.sh::setup_directories() after the top-level mkdir command to create the mdtest-hard directories with DNE striping over all MDTs (unfortunately, the output directories are named differently between the C-app and Bash versions):

mkdir -p $workdir-{scr,app} $resultdir-{scr,app}
# stripe mdtest-hard directories over all MDTs for best performance/scale
lfs mkdir -c -1 $workdir-app/mdtest-hard
lfs setdirstripe -D -c -1 $workdir-app/mdtest-hard
lfs mkdir -c -1 $workdir-scr/mdt_hard
lfs setdirstripe -D -c -1 $workdir-scr/mdt_hard

The lfs mkdir command creates the new directory striped across all MDTs (-c -1) and the lfs setdirstripe command sets the default layout (-D) for new subdirectories created beneath that one to also be striped across all MDTs. That isn't recommended for all directories due to added overhead, but since we know mdtest-hard is going to be creating millions of files in a single directory, that gives the best performance in this case. I don't think this is needed for the mdtest-easy runs, because each thread has its own directory.

abrekhov commented 4 years ago

Well thanks for advice! However I already have such tunes:

  lfs setstripe -L mdt -E 1M $workdir-scr/mdt_hard
  lfs setstripe -L mdt -E 1M $workdir-app/mdt-hard
  lfs setdirstripe -c -1 -D $workdir-scr/mdt_hard
  lfs setdirstripe -c -1 -D $workdir-app/mdt-hard

And I thought that it's the same as yours. I'm also using 2.13. But I'll try to use your parameters as is and reply later with results!

I would suspect that you are running into the single-directory size limits of ldiskfs, which is about 10-12M files per directory (depends on filename length). It is possible to increase the directory size limit by enabling the large_dir feature on the MDTs via tune2fs -O large_dir /dev/<MDT> (this is enabled by default on newer versions of Lustre).

However, it is also visible from the lfs df -i output that you are only using a single MDT for your testing.

# lfs df -i
UUID                      Inodes       IUsed       IFree IUse% Mounted on
iospb-MDT0000_UUID     146499696    41226803   105272893  29% /lustre/iospb[MDT:0]
iospb-MDT0001_UUID     146499696         355   146499341   1% /lustre/iospb[MDT:1]
iospb-MDT0002_UUID     146499696         355   146499341   1% /lustre/iospb[MDT:2]
iospb-MDT0003_UUID     146499696         355   146499341   1% /lustre/iospb[MDT:3]

The better way to increase the directory size limit is to stripe the directories over all of the MDTs. This should also improve the mdtest performance that you are seeing. You can add a line into io500.sh::setup_directories() after the top-level mkdir command to create the mdtest-hard directories with DNE striping over all MDTs (unfortunately, the output directories are named differently between the C-app and Bash versions):

mkdir -p $workdir-{scr,app} $resultdir-{scr,app}
# stripe mdtest-hard directories over all MDTs for best performance/scale
lfs mkdir -c -1 $workdir-app/mdtest-hard
lfs setdirstripe -D -c -1 $workdir-app/mdtest-hard
lfs mkdir -c -1 $workdir-scr/mdt_hard
lfs setdirstripe -D -c -1 $workdir-scr/mdt_hard

The lfs mkdir command creates the new directory striped across all MDTs (-c -1) and the lfs setdirstripe command sets the default layout (-D) for new subdirectories created beneath that one to also be striped across all MDTs. That isn't recommended for all directories due to added overhead, but since we know mdtest-hard is going to be creating millions of files in a single directory, that gives the best performance in this case. I don't think this is needed for the mdtest-easy runs, because each thread has its own directory.

abrekhov commented 4 years ago

I've launched test one more time. The C version started first. After all phases cleanup phase was launched. Right now the bash version started. I can't see in config-full.ini any cleanup option in bash version, is it right?

I suspect that it may be the cause of previous fail...

JulianKunkel commented 4 years ago

I'm not quite sure to understand you correctly. Most of the files/directories are purged after the run. Some files may remain in the skript version.

abrekhov commented 4 years ago

If I getting the flow right after benchmark starts first part (C or bash version of app) is chosen randomly. Our monitoring tool detected that after bash part in the last test files and directories were not purged. After it C version started and failed with "no space left" error.

I've started benchmark one more time. C version has run first this time. Monitoring tool detected that there was purging of files and directories. Bash version is now running. I hope it will end correctly without errors.

I suppose that cleanup process after bash version is not running by default and must be enabled in config.ini. But this is just a guess...

abrekhov commented 4 years ago

The both parts of benchmark is finished. The bash version were later and directories were not purged. I think my guess was right.

JulianKunkel commented 4 years ago

Yes, the bash version may retain some IOR files. Glad it worked now for you!

On Tue, Jun 23, 2020 at 12:08 PM Anton Brekhov notifications@github.com wrote:

The both parts of benchmark is finished. The bash version were later and directories was not purged. I think my guess was right.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/VI4IO/io500-app/issues/23#issuecomment-648076003, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGW5SQ4PSQ7LAK3WOS2WJLRYCELVANCNFSM4OC7URTA .

-- Dr. Julian Kunkel Lecturer, Department of Computer Science +44 (0) 118 378 8218 http://www.cs.reading.ac.uk/ https://hps.vi4io.org/ PGP Fingerprint: 1468 1A86 A908 D77E B40F 45D6 2B15 73A5 9D39 A28E

abrekhov commented 4 years ago

Thanks for help!

Few questions left: 1) Both C and Bash versions have finished but there was not any system-information.txt . Is it OK to make tar with results manually? 2) Are you sure that cleanup phases of C and Bash versions are same? Space usage of Lustre FS was not purged as in the case with C version

Yes, the bash version may retain some IOR files. Glad it worked now for you! On Tue, Jun 23, 2020 at 12:08 PM Anton Brekhov @.***> wrote: The both parts of benchmark is finished. The bash version were later and directories was not purged. I think my guess was right. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#23 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGW5SQ4PSQ7LAK3WOS2WJLRYCELVANCNFSM4OC7URTA . -- Dr. Julian Kunkel Lecturer, Department of Computer Science +44 (0) 118 378 8218 http://www.cs.reading.ac.uk/ https://hps.vi4io.org/ PGP Fingerprint: 1468 1A86 A908 D77E B40F 45D6 2B15 73A5 9D39 A28E

JulianKunkel commented 4 years ago

1. Both C and Bash versions have finished but there was not any system-information.txt . Is it OK to make tar with results manually?

Yes, that is fine, in the submission dialog, we explicitly ask to provide the system-information.txt file so you can create it manually and add it during the submission.

Are you sure that cleanup phases of C and Bash versions are same? Space usage of Lustre FS was not purged as in the case with C version Yes the versions are the same, except that bash scripts won't remove some files from IOR while the app attempts to clean everything.

abrekhov commented 4 years ago

Julian, thanks for help! I'm still thinking that there is no cleanup phase in bash version judging by the attached picture from our monitoring. I'll check it in next launches one more time. image I'm closing the issue

abrekhov commented 4 years ago

Is it possible to change io500.sh line (( app_first >= 50 )) && $io500_mpirun $io500_mpiargs $PWD/io500 $io500_ini --timestamp $timestamp || app_rc=$?

to run C version first? Cause there is only part of deleting generated files during ior_easy.

JulianKunkel commented 4 years ago

If storage capacity is an issue: Would it resolve the situation, if you add $ rm -rf $workdir-scr to delete the data of the script?

abrekhov commented 4 years ago

If storage capacity is an issue: Would it resolve the situation, if you add $ rm -rf $workdir-scr to delete the data of the script?

After ior_easy phase there're a lot of files. Manual deleting by rm -rf takes a lot of time. I've read this kind of articles. https://unix.stackexchange.com/questions/289281/removing-dirs-with-lots-of-tiny-files-on-lustre/542870 If I do not mistaken of course.

JulianKunkel commented 4 years ago

Basically you recreate the file system after the benchmarks are run to speed up the execution? Just want to know. Best if you send an email to board@io500.org with the statement. That helps best!

On Sun, Jul 5, 2020 at 2:08 PM Anton Brekhov notifications@github.com wrote:

If storage capacity is an issue: Would it resolve the situation, if you add $ rm -rf $workdir-scr to delete the data of the script?

After ior_easy phase there're a lot of files. Manual deleting by rm -rf takes a lot of time. I've read this kind of articles.

https://unix.stackexchange.com/questions/289281/removing-dirs-with-lots-of-tiny-files-on-lustre/542870 If I do not mistaken of course.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/VI4IO/io500-app/issues/23#issuecomment-653887025, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGW5SU62VGCI4OHGOIRC2DR2B3LRANCNFSM4OC7URTA .

-- Dr. Julian Kunkel Lecturer, Department of Computer Science +44 (0) 118 378 8218 http://www.cs.reading.ac.uk/ https://hps.vi4io.org/ PGP Fingerprint: 1468 1A86 A908 D77E B40F 45D6 2B15 73A5 9D39 A28E