flux-framework / dyad

DYAD: DYnamic and Asynchronous Data Streamliner
GNU Lesser General Public License v3.0
7 stars 5 forks source link

Adds a Data Transport Layer to DYAD to support different ways of transferring data #24

Closed ilumsden closed 12 months ago

ilumsden commented 1 year ago

This PR adds a new Data Transport Layer (DTL) to the DYAD module and DYAD Core library to allow us to support different ways to transfer data (i.e., transfer backends) between producer and consumer. Through this DTL, this PR also adds a new UCX-based transfer backend.

To set the backends, users will set the new DYAD_DTL_MODE environment variable. This variable must be set to the same value for both the clients (i.e., user applications that use the C or C++ APIs) and the DYAD module. Valid values are:

The code for the DTL can be found in src/core/dtl (for DYAD Core's side) and src/modules/dtl (for the DYAD module).

This PR also adds a couple of QOL improvements. Most notably, it adds a new dyad_init_env function to the Core library. This function will initialize the DYAD context (dyad_ctx_t) using the environment variables defined in src/core/dyad_env.h. By providing this function in Core, we can define a baseline environment variable-based initialization for all APIs.

ilumsden commented 1 year ago

@JaeseungYeom I've removed the last bit of excess logging and retested. So, this PR is now ready-for-review.

ilumsden commented 1 year ago

Thanks to some discussion with @grondo, we identified that a lot of symbols are being exported from DYAD's libraries that shouldn't be exported. I am in the process of correcting this.

One question related to symbol exporting @JaeseungYeom: what do we want to do about dyad_sync_directory? That function is conditionally compiled, and, as far as I know, there's no way to tell libtool to conditionally export a symbol.

JaeseungYeom commented 1 year ago

I think you can treat dyad_sync_directory similarly to other DAYD internal functions.

ilumsden commented 1 year ago

Notes from review:

JaeseungYeom commented 1 year ago
  1. Send ACK RPC response to validate that RPC correctly reached UCX send
  2. Add timeout to wait for UCX send/recv
  3. Check how timeouts work in UCX. If I can't find anything, at least add a TODO

If 3 is available and 2 is working, do we still need 1?

hariharan-devarajan commented 1 year ago
  1. Send ACK RPC response to validate that RPC correctly reached UCX send
  2. Add timeout to wait for UCX send/recv
  3. Check how timeouts work in UCX. If I can't find anything, at least add a TODO

If 3 is available and 2 is working, do we still need 1?

I believe they give different errors:

  1. identifies whether we were able to setup ucx endpoint for now doing data transfer.
  2. and 3. to me are ways to know if ucx transfer failed. I think if ucx has timeouts we should use that (that is 3) if not we should build some simple version of that (that is 2)
JaeseungYeom commented 1 year ago

I agree with that trying to see if option 3 is available first and then do option 2. For 1, it sounds to me it does not add any other functionality if either 2 or 3 works. If endpoint setup fails, would the entire DYAD shutdown or do we expect to retry? If the latter, I see it could be useful.

hariharan-devarajan commented 1 year ago

The only benefit is detection. Hanging errors are hard to detect. And this would lead to one such case.

JaeseungYeom commented 1 year ago

Don't forget to run clang formatter