IBM / CAST

CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.
Eclipse Public License 1.0
27 stars 34 forks source link

Can't connect to bbproxy when calling BB_InitLibrary() in post-stage #1002

Closed tonyhutter closed 3 years ago

tonyhutter commented 3 years ago

Describe the bug I want to check the BB transfer status via the BB API in 2nd-stage post-stage script. Basically, I want to see that the BB transfers I launched were successfully transferred. I notice that when I call BB_InitLibrary() on the post-stage node, I get this error:

{"id":"1","rc":"-2","env":{"configfile":"\/etc\/ibm\/bb.cfg"},"error": {"text":"Unable to create
 bb.proxyconnection","func":"BB_InitLibrary","line":"238", "sourcefile":
"\/u\/tgooding\/workspace\/castbuild\/bb\/src\/bbapi.cc"}}

I tried using the same contribId I used when I started the transfer, but got the error. I also tried contribId=0 and contribId=999999999 (UNDEFINED_CONTRIBID) but got the same error.

I see that I can run bbcmd gettransfers --target=0 --matchstatus=BBALL in post-stage and get the transfers. However, that codepath sets bb.api.noproxyinit in the config and maybe that makes a difference. I also see that bbcmd will set it's contribid to UNDEFINED_CONTRIBID, which isn't exported in the BBAPI.

So my question is: how do I use the BB API to get the transfer statuses in post-stage?

To Reproduce Call BB_InitLibrary() on the post-stage node.

Expected behavior I expect BB_InitLibrary() to connect to the BB server in post-stage.

Screenshots

Environment (please complete the following information):

$ rpm -q --whatprovides `which bbcmd`
ibm-burstbuffer-1.7.2-3094.ppc64le

Additional context

Issue Source:

tgooding commented 3 years ago

The C API is only available directly on the compute nodes (e.g., as part of the MPI binary). During post-stage (and pre-stage), CSM will prevent access to those nodes as they might be used by a different job (e.g., as we're staging out, another job can be using the CPU/GPUs).

tonyhutter commented 3 years ago

Ok, thanks for the answer. Ultimately what we want to do is start a checkpoint transfer at the end of a user's allocated time, and then have the poststage script verify the transfers were done successfully. It sounds like we can't do that using the BB C API for the reasons you mentioned.

As an alternative, could we simply check the final size of the destination file, and if it's what we expected, then we know the transfer was good? Or would there ever be a case where the destination filesize was correct, but the BB API hadn't finished transferring all the bits yet (because maybe it was preallocating the file or something), or hit some error such that some of the bytes were wrong? Or does the BB API only update the file size after it has correctly received and written the bytes?

tgooding commented 3 years ago

You should be able to initiate a transfer in the post-stage script (1st phase) through bbcmd. The filelist can be pre-populated by the application on the SSD such that you can blindly initiate a final transfer and check success/fail in the 2nd phase.

I would not rely on destination file size as seen by GPFS. Strong potential for a data integrity problem. Either by the file pre-existing or making assumptions about how bbServer writes data. (and certainly bbServer today can write blocks out-of-order, so its possible that the final block was written relatively early)

tonyhutter commented 3 years ago

You should be able to initiate a transfer in the post-stage script (1st phase) through bbcmd. The filelist can be pre-populated by the application on the SSD such that you can blindly initiate a final transfer and check success/fail in the 2nd phase.

I don't think launching the transfer is the issue. We've been able to do that successful from the BB C API. It's checking that it finished correctly afterwards in 2nd phase post-stage that's an issue. It would seem that since bbcmd can check the transfer status in 2nd post-stage, and it uses the BB C API, that our program (https://github.com/ECP-VeloC/AXL/blob/master/test/axl_cp.c) which also calls the BB C API, should be able to do it too. Basically I don't understand how bbcmd is able to get the transfer status in 2nd post-stage when "The C API is only available directly on the compute nodes".

I would not rely on destination file size as seen by GPFS. Strong potential for a data integrity problem. Either by the file pre-existing or making assumptions about how bbServer writes data. (and certainly bbServer today can write blocks out-of-order, so its possible that the final block was written relatively early)

Doesn't the 2nd stage only get called after the transfer completes/aborts? So by definition, bbServer should not be writing the file at the time the 2nd post-stage is called? Or is there a chance it could change the contents of the file during/after 2nd post-stage? I ask, because I wonder if we could just do a checksum of the destination file in 2nd post-stage to verify it's correct?

adammoody commented 3 years ago

Let's not try to checksum the files as a "transfer complete" check. That will be performance suicide at scale.

tgooding commented 3 years ago

bbcmd uses the CSM API (csm_bb_cmd()) to start an authorized executable on the compute node and retrieve its output. The CSM design is that only the running job on the CPUs has permission to start arbitrary executables (via jsrun/mpirun/ssh) on the compute nodes. Nodes are cleaned up between jobs to ensure there are no straggler processes, files, etc. And any access outside that phase requires authorized processes.

So the flow would be: on the launch node, bbcmd processes inputs, gathers a bit of info on allocation IDs, bundles them, and then calls the CSM API (or in non-CSM environments, it uses ssh). This results in the /opt/ibm/bb/bin/bbcmd executable getting started on the compute node-side, which can then use the C API. The thinking (pre-AXL) was that users would be scripting their staging scripts (e.g., like scheduler run scripts), so already using perl/python/bash that have good JSON processing libraries. I don't think that precludes using popen() within AXL to accomplish the same goal.

Doesn't the 2nd stage only get called after the transfer completes/aborts?

Correct. I thought you meant using file size to determine whether you needed to transfer or not. I think I see what you were intending now. At 2nd stage, querying the transfer status to determine the final transfer status would be sufficient.

tonyhutter commented 3 years ago

@tgooding thanks for the info. It sounds like we'll have AXL spawn bbcmd and scrape the output. That's answers my questions - closing bug.