Adds a Data Transport Layer to DYAD to support different ways of transferring data

ilumsden commented 1 year ago

This PR adds a new Data Transport Layer (DTL) to the DYAD module and DYAD Core library to allow us to support different ways to transfer data (i.e., transfer backends) between producer and consumer. Through this DTL, this PR also adds a new UCX-based transfer backend.

To set the backends, users will set the new DYAD_DTL_MODE environment variable. This variable must be set to the same value for both the clients (i.e., user applications that use the C or C++ APIs) and the DYAD module. Valid values are:

FLUX_RPC (~default~): uses Flux's RPC framework to transfer data. This is the same as before
UCX (default): uses UCX to transfer data

The code for the DTL can be found in src/core/dtl (for DYAD Core's side) and src/modules/dtl (for the DYAD module).

This PR also adds a couple of QOL improvements. Most notably, it adds a new dyad_init_env function to the Core library. This function will initialize the DYAD context (dyad_ctx_t) using the environment variables defined in src/core/dyad_env.h. By providing this function in Core, we can define a baseline environment variable-based initialization for all APIs.

ilumsden commented 1 year ago

@JaeseungYeom I've removed the last bit of excess logging and retested. So, this PR is now ready-for-review.

ilumsden commented 1 year ago

Thanks to some discussion with @grondo, we identified that a lot of symbols are being exported from DYAD's libraries that shouldn't be exported. I am in the process of correcting this.

One question related to symbol exporting @JaeseungYeom: what do we want to do about dyad_sync_directory? That function is conditionally compiled, and, as far as I know, there's no way to tell libtool to conditionally export a symbol.

JaeseungYeom commented 1 year ago

I think you can treat dyad_sync_directory similarly to other DAYD internal functions.

ilumsden commented 1 year ago

Notes from review:

[ ] Refactor DTL interfaces to use struct + factory method
[ ] Merge two DTL libraries into one
[ ] Send ACK RPC response to validate that RPC correctly reached UCX send
[ ] Add timeout to wait for UCX send/recv
[ ] Add delay of 10 us (i.e., usleep) if ucp_worker_progress takes less than 10 us
[ ] Check how timeouts work in UCX. If I can't find anything, at least add a TODO
[ ] Cache endpoints w/ LRU replacement (new PR for this)
[ ] Move tag creation to broker, and make tag (broker_rank << 32) | counter (counter is unique ID in broker)
[ ] Add parens around UCX_CHECK and arg to UCX_CHECK

JaeseungYeom commented 1 year ago

Send ACK RPC response to validate that RPC correctly reached UCX send
Add timeout to wait for UCX send/recv
Check how timeouts work in UCX. If I can't find anything, at least add a TODO

If 3 is available and 2 is working, do we still need 1?

hariharan-devarajan commented 1 year ago

Send ACK RPC response to validate that RPC correctly reached UCX send

Add timeout to wait for UCX send/recv

Check how timeouts work in UCX. If I can't find anything, at least add a TODO

If 3 is available and 2 is working, do we still need 1?

I believe they give different errors:

identifies whether we were able to setup ucx endpoint for now doing data transfer.
and 3. to me are ways to know if ucx transfer failed. I think if ucx has timeouts we should use that (that is 3) if not we should build some simple version of that (that is 2)

JaeseungYeom commented 1 year ago

I agree with that trying to see if option 3 is available first and then do option 2. For 1, it sounds to me it does not add any other functionality if either 2 or 3 works. If endpoint setup fails, would the entire DYAD shutdown or do we expect to retry? If the latter, I see it could be useful.

hariharan-devarajan commented 1 year ago

The only benefit is detection. Hanging errors are hard to detect. And this would lead to one such case.

JaeseungYeom commented 1 year ago

Don't forget to run clang formatter

flux-framework / dyad

Adds a Data Transport Layer to DYAD to support different ways of transferring data #24