CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.
Eclipse Public License 1.0
27 stars 34 forks source link

Prefer xcat name if available over hostname #988

Closed meahoibm closed 3 years ago

meahoibm commented 3 years ago

Unit testing on f5n05 and ln01. Code impacts running of bbcmd on compute node.

Previous errors of "no data" have gone away.

Materials from a bsub run are in: /ESS/gpfst/meaho/workdir/30925

$ ls -l
total 70
-rw-r--r-- 1 root  root  6563 Nov 12 20:07 30925-cat.txt
-rw------- 1 meaho meaho 7446 Nov 12 20:04 30925.env
-rw-rw-r-- 1 meaho meaho  167 Nov 12 20:06 30925.err
-rw------- 1 meaho meaho 1207 Nov 12 20:05 30925.log
-rw-rw-r-- 1 meaho meaho 2065 Nov 12 20:06 30925.out
drwxr-xr-x 2 root  root  4096 Nov 13 09:04 bbcmd-fn05
drwxr-xr-x 2 root  root  4096 Nov 13 08:45 bbcmd-ln01
-rw-r--r-- 1 root  root  5024 Nov 13 08:54 bb_stagein-30925.log
-rw-r--r-- 1 root  root  3793 Nov 13 08:54 bb_stageout-30925.log
-rw-rw-r-- 1 meaho meaho 5008 Nov 13 09:07 bhist-l-30925.txt```
meahoibm commented 3 years ago

Looking at some new errors: Job 30928: cmd: /opt/ibm//bb/bin/bbcmd --jobstepid=1 --target=0- getfileinfo Job 30928: json: {"id":"1","rc":"-2","0":{"id":"1","rc":"-2","in":{"apicall":"Coral_SetVar","misc":{"uid":"0","gid":"0"}},"f5n05_pvt_pok_stglabs_ibm_com:bb_api559588":{"breadcrumbs":{"bbproxy":{"msgin_setvar":{"exit":{"count":"1","ts":"1605276518.062115"}}}}},"error":{"text":"Connection closed waiting for the reply","func":"BB_GetFileInfo","line":"1499","sourcefile":"\/home\/build\/bb\/src\/"}},"goodcount":"0","failcount":"1","voidcount":"0","error":{"firstFailRank":"0","firstFailNode":"f5n05","command":"getfileinfo^--bbid^30928^--envs^BBPATH=\/mnt\/bb_4a8b735e2c3112caf29a1f41ee65cfad^--jobstepid=1^--csmcommand=f5n05:0","text":"Connection closed waiting for the reply"}} Job 30928: Job 30928: rc = -2 Job 30928: Command failure. rc=-2 Job 30928: Job 30928: cmd: /opt/ibm//bb/bin/bbcmd --jobstepid=0 --target=0 gettransfers --numhandles=0 --match=BBNOTSTARTED,BBINPROGRESS,BBPARTIALSUCCESS Job 30928: json: {"id":"1","rc":"-1","error":{"csm_stderrgrabrc":"-11","csm_hostlist":"f5n05","csm_rc":"-1","command":"gettransfers^--bbid^30928^--envs^BBPATH=\/mnt\/bb_4a8b735e2c3112caf29a1f41ee65cfad^--jobstepid=0^--matchstatus=BBNOTSTARTED,BBINPROGRESS,BBPARTIALSUCCESS^--numhandles=0^--csmcommand=f5n05:0","text":"no data from node"},"goodcount":"0","failcount":"0","voidcount":"1"} Job 30928: Job 30928: rc = -1 Job 30928: Command failure. rc=-1 Job 30928: cmd: /ESS/gpfst/IST_LSF/ -d 'BB: Stage-in admin script completed' -i 120 30928 Job 30928: command rc: 0