Open cjlovering opened 4 years ago
Hey Charles, thanks for reporting this!
I just pushed https://github.com/facebookresearch/torchbeast/commit/e233fbc3019abe693eb3e30ea560c2445377eb88 which should resolve this issue. Please let us know if you run into further problems.
BTW please note that we had limited success building the Docker image on MacOS as it seems to stall while compiling PyTorch. This may be a resource constraint.
Hello Heinrich,
Thanks for the help, unfortunately this did not end up fixing the issue to me; in the end a similar error occurred. (I included a few more messages from the build.) I will try building on a linux machine, and see if I can get it to work there.
Best, Charles
[...]
[1363/2619] Building CXX object caffe2/CMakeFiles/net_async_tracing_test.dir/core/net_async_tracing_test.cc.o
[1364/2619] Building CXX object caffe2/CMakeFiles/kernel_stackbased_test.dir/__/aten/src/ATen/core/op_registration/kernel_stackbased_test.cpp.o
[1365/2619] Building CXX object caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state_dlpack.cc.o
[1366/2619] Building CXX object caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state_registry.cc.o
[1367/2619] Building CXX object caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o
FAILED: caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o
/usr/bin/c++ -DAT_PARALLEL_OPENMP=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTH_BLAS_MKL -D_FILE_OFFSET_BITS=64 -D_THP_CORE -Dcaffe2_pybind11_state_EXPORTS -I../aten/src -I. -I../ -I../cmake/../third_party/benchmark/include -Icaffe2/contrib/aten -I../third_party/onnx -Ithird_party/onnx -I../third_party/foxi -Ithird_party/foxi -Icaffe2/aten/src/TH -I../aten/src/TH -Icaffe2/aten/src -Iaten/src -I../aten/../third_party/catch/single_include -I../aten/src/ATen/.. -Icaffe2/aten/src/ATen -I../third_party/miniz-2.0.8 -I../caffe2/core/nomnigraph/include -I../caffe2/../torch/csrc/api -I../caffe2/../torch/csrc/api/include -I../c10/.. -Ithird_party/ideep/mkl-dnn/include -I../third_party/ideep/mkl-dnn/src/../include -isystem third_party/gloo -isystem ../cmake/../third_party/gloo -isystem ../cmake/../third_party/googletest/googlemock/include -isystem ../cmake/../third_party/googletest/googletest/include -isystem ../third_party/protobuf/src -isystem /root/miniconda3/envs/torchbeast/include -isystem ../third_party/gemmlowp -isystem ../third_party/neon2sse -isystem ../third_party -isystem ../cmake/../third_party/eigen -isystem /root/miniconda3/envs/torchbeast/include/python3.7m -isystem /root/miniconda3/envs/torchbeast/lib/python3.7/site-packages/numpy/core/include -isystem ../cmake/../third_party/pybind11/include -isystem /opt/rocm/hip/include -isystem /include -isystem ../third_party/ideep/mkl-dnn/include -isystem ../third_party/ideep/include -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -fPIC -fvisibility=hidden -DCAFFE2_USE_GLOO -DHAVE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD -Wall -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wno-write-strings -Wno-unknown-pragmas -Wno-missing-braces -fopenmp -std=gnu++11 -MD -MT caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o -MF caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o.d -o caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o -c ../caffe2/python/pybind_state.cc
c++: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions.
[1368/2619] Building CXX object caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state_int8.cc.o
[1369/2619] Building CXX object caffe2/CMakeFiles/kernel_functor_test.dir/__/aten/src/ATen/core/op_registration/kernel_functor_test.cpp.o
[1370/2619] Building CXX object caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state_nomni.cc.o
ninja: build stopped: subcommand failed.
Building wheel torch-1.2.0a0+54a63e0
-- Building version 1.2.0a0+54a63e0
cmake -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/src/pytorch/torch -DCMAKE_PREFIX_PATH=/root/miniconda3/envs/torchbeast -DNUMPY_INCLUDE_DIR=/root/miniconda3/envs/torchbeast/lib/python3.7/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/root/miniconda3/envs/torchbeast/bin/python -DPYTHON_INCLUDE_DIR=/root/miniconda3/envs/torchbeast/include/python3.7m -DPYTHON_LIBRARY=/root/miniconda3/envs/torchbeast/lib/libpython3.7m.so.1.0 -DTORCH_BUILD_VERSION=1.2.0a0+54a63e0 -DUSE_CUDA=False -DUSE_DISTRIBUTED=True -DUSE_NUMPY=True -DUSE_SYSTEM_EIGEN_INSTALL=OFF /src/pytorch
cmake --build . --target install --config Release -- -j 4
Traceback (most recent call last):
File "setup.py", line 756, in <module>
build_deps()
File "setup.py", line 325, in build_deps
cmake=cmake)
File "/src/pytorch/tools/build_pytorch_libs.py", line 64, in build_caffe2
cmake.build(my_env)
File "/src/pytorch/tools/setup_helpers/cmake.py", line 321, in build
self.run(build_args, my_env)
File "/src/pytorch/tools/setup_helpers/cmake.py", line 133, in run
check_call(command, cwd=self.build_dir, env=env)
File "/root/miniconda3/envs/torchbeast/lib/python3.7/subprocess.py", line 347, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--target', 'install', '--config', 'Release', '--', '-j', '4']' returned non-zero exit status 1.
The command '/bin/bash -c python setup.py install' returned a non-zero code: 1
Hey @cjlovering, did you end up making progress on this?
Hey @cjlovering, did you end up making progress on this?
Hello @edran, so far no. I tried again with new installs and updated source on MacOS and it did not work. I have not had a chance do so on a linux machine (as I don't have immediate access to one with sufficient permissions). I will be coming back to this issue again soon.
Docker Desktop on mac has its own set of resource constraints (controlled in Preferences -> Advanced). I wonder if the internal compiler error is caused by failing to allocate memory due to the constraint.
Hello! I've stopped trying to get this to work on Mac, and now trying on Google cloud.
I have gotten docker and polybeast to run, but I have not been able do use GPUs with it. Do you have a recommended approach for using docker with GPUs?
@cjlovering you most likely want to use nvidia-docker
, and modify our image to:
@edran Thank you! (I followed the second option and used nvidia/cuda:10.1-base-ubuntu18.04
).
I think the GPUs are available and cuda is installed. For instance, I was able to run nvidia-smi
and see the GPU status (by adding another CMD to the dockerfile). However, when the polybeast script is run it does not find that cuda available.
Is there something I should update in the pytorch installation or something along those lines?
I don't know whether the base image is enough to pull both cuda and cudnn, and it's possible that both might be required for pytorch to be compiled with cuda support. Try using 10.1-cudnn7-runtime-ubuntu18.04
.
Also, if you share your dockerfile I can give it a go locally to see whether I spot issues.
Thanks! I tried updating the image with that and it didn't seem to work for me.
Here's the file (with the updated image): https://github.com/cjlovering/torchbeast/blob/master/Dockerfile
After installing docker (on MacOS), the build failed. I am on the latest commit in master.
I get the following message: