flux-framework / dyad

DYAD: DYnamic and Asynchronous Data Streamliner
GNU Lesser General Public License v3.0
7 stars 5 forks source link

Add delay to ucp_worker_progress to prevent issues #32

Open ilumsden opened 1 year ago

ilumsden commented 1 year ago

As @hariharan-devarajan mentioned, ucp_worker_progress may cause issues if it runs too fast. To prevent any possible issues, we should (1) check how long ucp_worker_progress runs and (2) add a usleep of ~10us if it runs too fast.

hariharan-devarajan commented 1 year ago

I found how they are doing timeouts. Look for get_deadline in file

Also, From quick search on the repo, i see they do have a timeout error and some structures use timeout. Its very wierdly documented like you said.

  1. endpoint https://github.com/openucx/ucx/blob/1004670c50c55c1c526a58fd2586853e4a21c779/src/ucs/type/status.c#L76
  2. on event API https://github.com/openucx/ucx/blob/1004670c50c55c1c526a58fd2586853e4a21c779/src/ucs/sys/event_set.h#L124C14-L124C32
  3. on uct_rdmacm_cm_config https://github.com/openucx/ucx/blob/1004670c50c55c1c526a58fd2586853e4a21c779/src/uct/ib/rdmacm/rdmacm_cm.h#L53C16-L53C36
  4. RPC send test https://github.com/openucx/ucx/blob/1004670c50c55c1c526a58fd2586853e4a21c779/src/tools/perf/perftest_mad.c#L93