facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)
MIT License
3.75k stars 837 forks source link

Can't install torch rec on gcp #311

Closed zzh1024 closed 1 year ago

zzh1024 commented 1 year ago

Hi team, I saw we benchmark dlrm on gcp which rely on torch rec. But I fail to with different instructions, for sample install by package or from source. Met some problem like

torchx 2022-12-10 02:43:51 INFO     Waiting for the app to finish...
test_installation/0 Traceback (most recent call last):
test_installation/0   File "/opt/conda/envs/py37/lib/python3.7/runpy.py", line 183, in _run_module_as_main
test_installation/0     mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
test_installation/0   File "/opt/conda/envs/py37/lib/python3.7/runpy.py", line 109, in _get_module_details
test_installation/0     __import__(pkg_name)
test_installation/0   File "/opt/conda/envs/py37/lib/python3.7/site-packages/torch/__init__.py", line 219, in <module>
test_installation/0     from torch._C import *  # noqa: F403
test_installation/0 ImportError: /opt/conda/envs/py37/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so: symbol cusparseSpSM_analysis version libcusparse.so.11 not defined in file libcusparse.so.11 with link time reference
torchx 2022-12-10 02:43:52 INFO     Job finished: 5
torchx 2022-12-10 02:43:52 ERROR    AppStatus:
  msg: <NONE>
  num_restarts: 0
  roles: []
  state: FAILED (5)
  structured_error_msg: <NONE>
  ui_url: file:///tmp/torchx_u_y9m5bi/torchx/test_installation-mwhd4nznnjs7f

Do we have an instruction on how to install torch rec on gcp.

When we reserve google cloud vertex ai notebook, should I select notebook python3 (CUDA toolkit 11.0) or with python only and install cuda by ourselves (Bz I saw torch rec's install example CUDA tookit is 11.6 or something else).

Are there any preference for os Debian or Ubuntu? Thx in advance.

mnaumovfb commented 1 year ago

Hello @zzh1024, I was talking to @samiwilf regarding the ImportError, and his thoughts were that this error was related to some sort of a mismatching between pyTorch, TorchTec, and fbgemm versions. Let us know if you are using these requirements.txt or Dockerfile.

We don’t really have instructions specific to GCP or have an OS preference, other than that some of the testing has been done on Ubuntu.

colin2328 commented 1 year ago

I would recommend first installing pytorch with CUDA enabled, and then installing fbgemm-gpu. If after installing fbgemm-gpu (which you can test by running https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu#running-fbgemm_gpu), TorchRec , and dlrm should work for you