facebookresearch / torchbeast

A PyTorch Platform for Distributed RL
Apache License 2.0
738 stars 114 forks source link

Issues with Docker #3

Open cjlovering opened 4 years ago

cjlovering commented 4 years ago

After installing docker (on MacOS), the build failed. I am on the latest commit in master.

I get the following message:

Traceback (most recent call last):
  File "setup.py", line 759, in <module>
    build_deps()
  File "setup.py", line 311, in build_deps
    cmake=cmake)
  File "/src/pytorch/tools/build_pytorch_libs.py", line 59, in build_caffe2
    cmake.build(my_env)
  File "/src/pytorch/tools/setup_helpers/cmake.py", line 334, in build
    self.run(build_args, my_env)
  File "/src/pytorch/tools/setup_helpers/cmake.py", line 142, in run
    check_call(command, cwd=self.build_dir, env=env)
  File "/root/miniconda3/envs/torchbeast/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--target', 'install', '--config', 'Release', '--', '-j', '4']' returned non-zero exit status 1.
heiner commented 4 years ago

Hey Charles, thanks for reporting this!

I just pushed https://github.com/facebookresearch/torchbeast/commit/e233fbc3019abe693eb3e30ea560c2445377eb88 which should resolve this issue. Please let us know if you run into further problems.

BTW please note that we had limited success building the Docker image on MacOS as it seems to stall while compiling PyTorch. This may be a resource constraint.

cjlovering commented 4 years ago

Hello Heinrich,

Thanks for the help, unfortunately this did not end up fixing the issue to me; in the end a similar error occurred. (I included a few more messages from the build.) I will try building on a linux machine, and see if I can get it to work there.

Best, Charles

[...]
[1363/2619] Building CXX object caffe2/CMakeFiles/net_async_tracing_test.dir/core/net_async_tracing_test.cc.o
[1364/2619] Building CXX object caffe2/CMakeFiles/kernel_stackbased_test.dir/__/aten/src/ATen/core/op_registration/kernel_stackbased_test.cpp.o
[1365/2619] Building CXX object caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state_dlpack.cc.o
[1366/2619] Building CXX object caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state_registry.cc.o
[1367/2619] Building CXX object caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o
FAILED: caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o 
/usr/bin/c++  -DAT_PARALLEL_OPENMP=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTH_BLAS_MKL -D_FILE_OFFSET_BITS=64 -D_THP_CORE -Dcaffe2_pybind11_state_EXPORTS -I../aten/src -I. -I../ -I../cmake/../third_party/benchmark/include -Icaffe2/contrib/aten -I../third_party/onnx -Ithird_party/onnx -I../third_party/foxi -Ithird_party/foxi -Icaffe2/aten/src/TH -I../aten/src/TH -Icaffe2/aten/src -Iaten/src -I../aten/../third_party/catch/single_include -I../aten/src/ATen/.. -Icaffe2/aten/src/ATen -I../third_party/miniz-2.0.8 -I../caffe2/core/nomnigraph/include -I../caffe2/../torch/csrc/api -I../caffe2/../torch/csrc/api/include -I../c10/.. -Ithird_party/ideep/mkl-dnn/include -I../third_party/ideep/mkl-dnn/src/../include -isystem third_party/gloo -isystem ../cmake/../third_party/gloo -isystem ../cmake/../third_party/googletest/googlemock/include -isystem ../cmake/../third_party/googletest/googletest/include -isystem ../third_party/protobuf/src -isystem /root/miniconda3/envs/torchbeast/include -isystem ../third_party/gemmlowp -isystem ../third_party/neon2sse -isystem ../third_party -isystem ../cmake/../third_party/eigen -isystem /root/miniconda3/envs/torchbeast/include/python3.7m -isystem /root/miniconda3/envs/torchbeast/lib/python3.7/site-packages/numpy/core/include -isystem ../cmake/../third_party/pybind11/include -isystem /opt/rocm/hip/include -isystem /include -isystem ../third_party/ideep/mkl-dnn/include -isystem ../third_party/ideep/include -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3  -fPIC   -fvisibility=hidden -DCAFFE2_USE_GLOO -DHAVE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD -Wall -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wno-write-strings -Wno-unknown-pragmas -Wno-missing-braces -fopenmp -std=gnu++11 -MD -MT caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o -MF caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o.d -o caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o -c ../caffe2/python/pybind_state.cc
c++: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions.
[1368/2619] Building CXX object caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state_int8.cc.o
[1369/2619] Building CXX object caffe2/CMakeFiles/kernel_functor_test.dir/__/aten/src/ATen/core/op_registration/kernel_functor_test.cpp.o
[1370/2619] Building CXX object caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state_nomni.cc.o
ninja: build stopped: subcommand failed.
Building wheel torch-1.2.0a0+54a63e0
-- Building version 1.2.0a0+54a63e0
cmake -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/src/pytorch/torch -DCMAKE_PREFIX_PATH=/root/miniconda3/envs/torchbeast -DNUMPY_INCLUDE_DIR=/root/miniconda3/envs/torchbeast/lib/python3.7/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/root/miniconda3/envs/torchbeast/bin/python -DPYTHON_INCLUDE_DIR=/root/miniconda3/envs/torchbeast/include/python3.7m -DPYTHON_LIBRARY=/root/miniconda3/envs/torchbeast/lib/libpython3.7m.so.1.0 -DTORCH_BUILD_VERSION=1.2.0a0+54a63e0 -DUSE_CUDA=False -DUSE_DISTRIBUTED=True -DUSE_NUMPY=True -DUSE_SYSTEM_EIGEN_INSTALL=OFF /src/pytorch
cmake --build . --target install --config Release -- -j 4
Traceback (most recent call last):
  File "setup.py", line 756, in <module>
    build_deps()
  File "setup.py", line 325, in build_deps
    cmake=cmake)
  File "/src/pytorch/tools/build_pytorch_libs.py", line 64, in build_caffe2
    cmake.build(my_env)
  File "/src/pytorch/tools/setup_helpers/cmake.py", line 321, in build
    self.run(build_args, my_env)
  File "/src/pytorch/tools/setup_helpers/cmake.py", line 133, in run
    check_call(command, cwd=self.build_dir, env=env)
  File "/root/miniconda3/envs/torchbeast/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--target', 'install', '--config', 'Release', '--', '-j', '4']' returned non-zero exit status 1.
The command '/bin/bash -c python setup.py install' returned a non-zero code: 1
edran commented 4 years ago

Hey @cjlovering, did you end up making progress on this?

cjlovering commented 4 years ago

Hey @cjlovering, did you end up making progress on this?

Hello @edran, so far no. I tried again with new installs and updated source on MacOS and it did not work. I have not had a chance do so on a linux machine (as I don't have immediate access to one with sufficient permissions). I will be coming back to this issue again soon.

bottler commented 4 years ago

Docker Desktop on mac has its own set of resource constraints (controlled in Preferences -> Advanced). I wonder if the internal compiler error is caused by failing to allocate memory due to the constraint.

cjlovering commented 4 years ago

Hello! I've stopped trying to get this to work on Mac, and now trying on Google cloud.

I have gotten docker and polybeast to run, but I have not been able do use GPUs with it. Do you have a recommended approach for using docker with GPUs?

edran commented 4 years ago

@cjlovering you most likely want to use nvidia-docker, and modify our image to:

  1. either have cuda installed before pytorch;
  2. or simply replace https://github.com/facebookresearch/torchbeast/blob/master/Dockerfile#L2 with an image from NVIDIA's hub: https://hub.docker.com/r/nvidia/cuda/ (the 18.04 cudnn one should work in theory).
cjlovering commented 4 years ago

@edran Thank you! (I followed the second option and used nvidia/cuda:10.1-base-ubuntu18.04).

I think the GPUs are available and cuda is installed. For instance, I was able to run nvidia-smi and see the GPU status (by adding another CMD to the dockerfile). However, when the polybeast script is run it does not find that cuda available.

Is there something I should update in the pytorch installation or something along those lines?

edran commented 4 years ago

I don't know whether the base image is enough to pull both cuda and cudnn, and it's possible that both might be required for pytorch to be compiled with cuda support. Try using 10.1-cudnn7-runtime-ubuntu18.04.

Also, if you share your dockerfile I can give it a go locally to see whether I spot issues.

cjlovering commented 4 years ago

Thanks! I tried updating the image with that and it didn't seem to work for me.

Here's the file (with the updated image): https://github.com/cjlovering/torchbeast/blob/master/Dockerfile