NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.04k stars 143 forks source link

[BUG] UCX issue for Multi-GPU criteo/DLRM #358

Open PerkzZheng opened 3 years ago

PerkzZheng commented 3 years ago

Describe the bug UCX issue when trying to reproduce Multi-GPU criteo/DLRM script in NVTabular. The script is /nvtabular/examples/dask-nvtabular-criteo-benchmark.py. What the problem could be that lead to this error ?

Steps/Code to reproduce bug

python3 /nvtabular/examples/dask-nvtabular-criteo-benchmark.py --data-path '/workdir/NVT-Dataset-parquet' --out-path '/workdir/NVT-dask/' --freq-limit 6 --device-pool-frac 0.9 --out-files-per-proc 8 --devices "0,1,2,3,4,5,6,7" -p "ucx"

The data-input directory has three parquet files (day_0.parquet, day_1.parquet, day_2.parquet)

Environment details (please complete the following information):

stderr output


opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cuda-0+untagged.1.g712364e-py3.7.egg/dask_cuda/local_cuda_cluster.py:185: UserWarning: When using NVLink we recommend setting a `rmm_pool_size`. Please see: https://dask-cuda.readthedocs.io/en/latest/ucx.html#important-notes for more details

Dask-NVTabular DLRM/Criteo benchmark
--------------------------------------
partition size     | 2118189056
protocol           | ucx
device(s)          | 0,1,2,3,4,5,6,7
rmm-pool-frac      | 0.8
out-files-per-proc | 8
shuffle            | PER_PARTITION
cats-on-device     | False
======================================
Runtime[s]         | 13.352912664413452
======================================

[1603176178.854886] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor 
[1603176178.855061] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176178.855236] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176178.855367] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor 
[1603176178.855483] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176178.855593] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176178.855701] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176178.855808] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176179.863101] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor 
[1603176179.863250] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176179.863396] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176179.863477] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176179.863554] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176179.863634] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176179.863712] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
[1603176179.863790] [9eae0e187b50:16939:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor 
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/ucx.py", line 299, in close
    await self.ep.send(struct.pack("?Q", True, 0))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/endpoint_reuse.py", line 102, in send
    await self.handle.ep.send(buffer, nbytes=nbytes, tag=self.tag)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/core.py", line 583, in send
    return await comm.tag_send(self._ep, buffer, nbytes, tag, name=log)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 44, in tag_send
    event_loop, ucx_api.tag_send_nb, ep, buffer, nbytes, tag, name=name
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 28, in _call_ucx_api
    req = func(*args, **kwargs)
  File "ucp/_libs/ucx_api.pyx", line 738, in ucp._libs.ucx_api.tag_send_nb
  File "ucp/_libs/ucx_api.pyx", line 620, in ucp._libs.ucx_api._handle_status
ucp.exceptions.UCXError: <[Send #352] ep: 0x7f12e58ae1f8, tag: 0x1753416861fb0ee9, nbytes: 16, type: <class 'bytes'>>: Input/output error
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/ucx.py", line 299, in close
    await self.ep.send(struct.pack("?Q", True, 0))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/endpoint_reuse.py", line 102, in send
    await self.handle.ep.send(buffer, nbytes=nbytes, tag=self.tag)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/core.py", line 583, in send
    return await comm.tag_send(self._ep, buffer, nbytes, tag, name=log)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 44, in tag_send
    event_loop, ucx_api.tag_send_nb, ep, buffer, nbytes, tag, name=name
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 28, in _call_ucx_api
    req = func(*args, **kwargs)
  File "ucp/_libs/ucx_api.pyx", line 738, in ucp._libs.ucx_api.tag_send_nb
  File "ucp/_libs/ucx_api.pyx", line 620, in ucp._libs.ucx_api._handle_status
ucp.exceptions.UCXError: <[Send #397] ep: 0x7f12e58ae168, tag: 0x7f4b6d5e5e6699d6, nbytes: 16, type: <class 'bytes'>>: Input/output error
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/ucx.py", line 299, in close
    await self.ep.send(struct.pack("?Q", True, 0))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/endpoint_reuse.py", line 102, in send
    await self.handle.ep.send(buffer, nbytes=nbytes, tag=self.tag)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/core.py", line 583, in send
    return await comm.tag_send(self._ep, buffer, nbytes, tag, name=log)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 44, in tag_send
    event_loop, ucx_api.tag_send_nb, ep, buffer, nbytes, tag, name=name
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 28, in _call_ucx_api
    req = func(*args, **kwargs)
  File "ucp/_libs/ucx_api.pyx", line 738, in ucp._libs.ucx_api.tag_send_nb
  File "ucp/_libs/ucx_api.pyx", line 620, in ucp._libs.ucx_api._handle_status
ucp.exceptions.UCXError: <[Send #385] ep: 0x7f12e58ae0d8, tag: 0x5802896346af26fa, nbytes: 16, type: <class 'bytes'>>: Input/output error
tornado.application - ERROR - Multiple exceptions in yield list 
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/ucx.py", line 299, in close
    await self.ep.send(struct.pack("?Q", True, 0))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/endpoint_reuse.py", line 102, in send
    await self.handle.ep.send(buffer, nbytes=nbytes, tag=self.tag)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/core.py", line 583, in send
    return await comm.tag_send(self._ep, buffer, nbytes, tag, name=log)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 44, in tag_send
    event_loop, ucx_api.tag_send_nb, ep, buffer, nbytes, tag, name=name
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 28, in _call_ucx_api
    req = func(*args, **kwargs)
  File "ucp/_libs/ucx_api.pyx", line 738, in ucp._libs.ucx_api.tag_send_nb
  File "ucp/_libs/ucx_api.pyx", line 620, in ucp._libs.ucx_api._handle_status
ucp.exceptions.UCXError: <[Send #376] ep: 0x7f12e58ae240, tag: 0x4427a49ae3d32785, nbytes: 16, type: <class 'bytes'>>: Input/output error
tornado.application - ERROR - Multiple exceptions in yield list 
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/ucx.py", line 299, in close
    await self.ep.send(struct.pack("?Q", True, 0))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/endpoint_reuse.py", line 102, in send
    await self.handle.ep.send(buffer, nbytes=nbytes, tag=self.tag)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/core.py", line 583, in send
    return await comm.tag_send(self._ep, buffer, nbytes, tag, name=log)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 44, in tag_send
    event_loop, ucx_api.tag_send_nb, ep, buffer, nbytes, tag, name=name
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 28, in _call_ucx_api
    req = func(*args, **kwargs)
  File "ucp/_libs/ucx_api.pyx", line 738, in ucp._libs.ucx_api.tag_send_nb
  File "ucp/_libs/ucx_api.pyx", line 620, in ucp._libs.ucx_api._handle_status
ucp.exceptions.UCXError: <[Send #379] ep: 0x7f12e58ae288, tag: 0x14bc813b3438e64, nbytes: 16, type: <class 'bytes'>>: Input/output error
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/ucx.py", line 299, in close
    await self.ep.send(struct.pack("?Q", True, 0))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/endpoint_reuse.py", line 102, in send
    await self.handle.ep.send(buffer, nbytes=nbytes, tag=self.tag)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/core.py", line 583, in send
    return await comm.tag_send(self._ep, buffer, nbytes, tag, name=log)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 44, in tag_send
    event_loop, ucx_api.tag_send_nb, ep, buffer, nbytes, tag, name=name
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 28, in _call_ucx_api
    req = func(*args, **kwargs)
  File "ucp/_libs/ucx_api.pyx", line 738, in ucp._libs.ucx_api.tag_send_nb
  File "ucp/_libs/ucx_api.pyx", line 620, in ucp._libs.ucx_api._handle_status
ucp.exceptions.UCXError: <[Send #382] ep: 0x7f12e58ae2d0, tag: 0x27859edb8eb0f690, nbytes: 16, type: <class 'bytes'>>: Input/output error
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/ucx.py", line 299, in close
    await self.ep.send(struct.pack("?Q", True, 0))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/endpoint_reuse.py", line 102, in send
    await self.handle.ep.send(buffer, nbytes=nbytes, tag=self.tag)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/core.py", line 583, in send
    return await comm.tag_send(self._ep, buffer, nbytes, tag, name=log)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 44, in tag_send
    event_loop, ucx_api.tag_send_nb, ep, buffer, nbytes, tag, name=name
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 28, in _call_ucx_api
    req = func(*args, **kwargs)
  File "ucp/_libs/ucx_api.pyx", line 738, in ucp._libs.ucx_api.tag_send_nb
  File "ucp/_libs/ucx_api.pyx", line 620, in ucp._libs.ucx_api._handle_status
ucp.exceptions.UCXError: <[Send #367] ep: 0x7f12e58ae1b0, tag: 0x73a9df43c93c1268, nbytes: 16, type: <class 'bytes'>>: Input/output error
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/deploy/spec.py", line 641, in close_clusters
    cluster.close(timeout=10)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 92, in close
    return self.sync(self._close, callback_timeout=timeout)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 171, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/utils.py", line 339, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/utils.py", line 323, in f
    result[0] = yield future
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/deploy/spec.py", line 411, in _close
    await self.scheduler.close()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/scheduler.py", line 1583, in close
    await super(Scheduler, self).close()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/core.py", line 631, in close
    yield [comm.close() for comm in list(self._comms)]  # then forcefully close
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/ucx.py", line 299, in close
    await self.ep.send(struct.pack("?Q", True, 0))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/endpoint_reuse.py", line 102, in send
    await self.handle.ep.send(buffer, nbytes=nbytes, tag=self.tag)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/core.py", line 583, in send
    return await comm.tag_send(self._ep, buffer, nbytes, tag, name=log)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 44, in tag_send
    event_loop, ucx_api.tag_send_nb, ep, buffer, nbytes, tag, name=name
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/ucp/comm.py", line 28, in _call_ucx_api
    req = func(*args, **kwargs)
  File "ucp/_libs/ucx_api.pyx", line 738, in ucp._libs.ucx_api.tag_send_nb
  File "ucp/_libs/ucx_api.pyx", line 620, in ucp._libs.ucx_api._handle_status 
ucp.exceptions.UCXError: <[Send #397] ep: 0x7f12e58ae120, tag: 0x1998467c13590b94, nbytes: 16, type: <class 'bytes'>>: Input/output error
[1603176179.878076] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f6b7f83c0 was not returned to mpool ucp_am_bufs
[1603176179.878089] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f6cff8540 was not returned to mpool ucp_am_bufs
[1603176179.878109] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f6dff8640 was not returned to mpool ucp_am_bufs
[1603176179.878112] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f6e7f86c0 was not returned to mpool ucp_am_bufs
[1603176179.878116] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f727f8ac0 was not returned to mpool ucp_am_bufs
[1603176179.878316] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f8affb340 was not returned to mpool ucp_am_bufs
[1603176179.878342] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f8b7fb3c0 was not returned to mpool ucp_am_bufs
[1603176179.878347] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f8bffb440 was not returned to mpool ucp_am_bufs
[1603176179.878369] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f8c7fb4c0 was not returned to mpool ucp_am_bufs
[1603176179.878373] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f8cffb540 was not returned to mpool ucp_am_bufs
[1603176179.878379] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f8effb740 was not returned to mpool ucp_am_bufs
[1603176179.878385] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0f9f7fc7c0 was not returned to mpool ucp_am_bufs
[1603176179.878390] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0fa17fc9c0 was not returned to mpool ucp_am_bufs
[1603176179.878399] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0fa8ffd140 was not returned to mpool ucp_am_bufs
[1603176179.878410] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0fa97fd1c0 was not returned to mpool ucp_am_bufs
[1603176179.878425] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f0fb2ffdb40 was not returned to mpool ucp_am_bufs
[1603176179.878665] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f10447fbcc0 was not returned to mpool ucp_am_bufs
[1603176179.878679] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f10457fbdc0 was not returned to mpool ucp_am_bufs
[1603176179.878691] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f1046ffbf40 was not returned to mpool ucp_am_bufs
[1603176179.878706] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f10487fc0c0 was not returned to mpool ucp_am_bufs
[1603176179.878720] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f1048ffc140 was not returned to mpool ucp_am_bufs
[1603176179.878727] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f1049ffc240 was not returned to mpool ucp_am_bufs
[1603176179.878735] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f104c7fc4c0 was not returned to mpool ucp_am_bufs
[1603176179.878742] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f104cffc540 was not returned to mpool ucp_am_bufs
[1603176179.878747] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f104d7fc5c0 was not returned to mpool ucp_am_bufs
[1603176179.878754] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f104dffc640 was not returned to mpool ucp_am_bufs
[1603176179.878760] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f104effc740 was not returned to mpool ucp_am_bufs
[1603176179.878768] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f104f7fc7c0 was not returned to mpool ucp_am_bufs
[1603176179.878775] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f104fffc840 was not returned to mpool ucp_am_bufs
[1603176179.878782] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f1051ffca40 was not returned to mpool ucp_am_bufs
[1603176179.878789] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f10527fcac0 was not returned to mpool ucp_am_bufs
[1603176179.878796] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f1052ffcb40 was not returned to mpool ucp_am_bufs
[1603176179.878804] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f10537fcbc0 was not returned to mpool ucp_am_bufs
[1603176179.878810] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f1053ffcc40 was not returned to mpool ucp_am_bufs
[1603176179.878817] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f10547fccc0 was not returned to mpool ucp_am_bufs
[1603176179.878824] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f1055ffce40 was not returned to mpool ucp_am_bufs
[1603176179.878832] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f1056ffcf40 was not returned to mpool ucp_am_bufs
[1603176179.878840] [9eae0e187b50:16939:1]          mpool.c:43   UCX  WARN  object 0x7f10577fcfc0 was not returned to mpool ucp_am_bufs
PerkzZheng commented 3 years ago

I have tried the original DLRM criteo mutli-gpu scrip in /examples, it is still leading to UCX errors when setting -p "ucx".

benfred commented 2 years ago

@PerkzZheng sorry we missed you on this one - can you test on the latest version, and if this is still an issue we'll dig in?

viswa-nvidia commented 2 years ago

@benfred should we be tracking this still ?