facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)
MIT License
3.71k stars 825 forks source link

Use torchcompat to work on other devices #384

Open Delaunay opened 3 months ago

Delaunay commented 3 months ago

The idea is to replace all mention to torchcompat.core which mirror torch.cuda for many devices (cuda, XPU, HPU)

Delaunay commented 3 months ago

Seems some primitive are not implemented for HPUs

dlrm.0 AttributeError: module 'torch._C' has no attribute '_broadcast_coalesced'
dlrm.0 [stderr] Traceback (most recent call last):
dlrm.0 [stderr]   File "/home/sdp/results/venv/torch/bin/voir", line 8, in <module>
dlrm.0 [stderr]     sys.exit(main())
dlrm.0 [stderr]   File "/home/sdp/voir/voir/cli.py", line 124, in main
dlrm.0 [stderr]     ov(sys.argv[1:] if argv is None else argv)
dlrm.0 [stderr]   File "/home/sdp/voir/voir/phase.py", line 331, in __call__
dlrm.0 [stderr]     self._run(*args, **kwargs)
dlrm.0 [stderr]   File "/home/sdp/voir/voir/overseer.py", line 242, in _run
dlrm.0 [stderr]     set_value(func())
dlrm.0 [stderr]   File "/home/sdp/voir/voir/scriptutils.py", line 37, in <lambda>
dlrm.0 [stderr]     return lambda: exec(mainsection, glb, glb)
dlrm.0 [stderr]   File "/home/sdp/milabench/benchmarks/dlrm/dlrm/dlrm_s_pytorch.py", line 1911, in <module>
dlrm.0 [stderr]     run()
dlrm.0 [stderr]   File "/home/sdp/milabench/benchmarks/dlrm/dlrm/dlrm_s_pytorch.py", line 1579, in run
dlrm.0 [stderr]     Z = dlrm_wrap(
dlrm.0 [stderr]   File "/home/sdp/milabench/benchmarks/dlrm/dlrm/dlrm_s_pytorch.py", line 146, in dlrm_wrap
dlrm.0 [stderr]     return dlrm(X.to(device), lS_o, lS_i)
dlrm.0 [stderr]   File "/home/sdp/results/venv/torch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in 
_wrapped_call_impl
dlrm.0 [stderr]     return self._call_impl(*args, **kwargs)
dlrm.0 [stderr]   File "/home/sdp/results/venv/torch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1523, in _call_impl
dlrm.0 [stderr]     return forward_call(*args, **kwargs)
dlrm.0 [stderr]   File "/home/sdp/milabench/benchmarks/dlrm/dlrm/dlrm_s_pytorch.py", line 530, in forward
dlrm.0 [stderr]     return self.parallel_forward(dense_x, lS_o, lS_i)
dlrm.0 [stderr]   File "/home/sdp/milabench/benchmarks/dlrm/dlrm/dlrm_s_pytorch.py", line 631, in parallel_forward
dlrm.0 [stderr]     self.bot_l_replicas = replicate(self.bot_l, device_ids)
dlrm.0 [stderr]   File "/home/sdp/results/venv/torch/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 110, in replicate
dlrm.0 [stderr]     param_copies = _broadcast_coalesced_reshape(params, devices, detach)
dlrm.0 [stderr]   File "/home/sdp/results/venv/torch/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 83, in 
_broadcast_coalesced_reshape
dlrm.0 [stderr]     tensor_copies = Broadcast.apply(devices, *tensors)
dlrm.0 [stderr]   File "/home/sdp/results/venv/torch/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
dlrm.0 [stderr]     return super().apply(*args, **kwargs)  # type: ignore[misc]
dlrm.0 [stderr]   File "/home/sdp/results/venv/torch/lib/python3.10/site-packages/torch/nn/parallel/_functions.py", line 23, in forward
dlrm.0 [stderr]     outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
dlrm.0 [stderr]   File "/home/sdp/results/venv/torch/lib/python3.10/site-packages/torch/nn/parallel/comm.py", line 57, in 
broadcast_coalesced
dlrm.0 [stderr]     return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
dlrm.0 [stderr] AttributeError: module 'torch._C' has no attribute '_broadcast_coalesced'