darglein / ADOP

MIT License
2.03k stars 199 forks source link

build failures, undefined references when building, Docker #62

Open mureva opened 2 years ago

mureva commented 2 years ago

I'm trying to build ADOP without Conda so I can run it on a remote machine - the only machine I have access to with powerful enough GPU - for which I need to run with a Docker container.

I have managed to build on my local machine, but no matter what settings I use on my trivial test dataset it fails to allocate memory on that machine's "meagre" 8GB 1070.

Following the same procedure that gave me success, I believe I've installed all relevant dependencies. The base container is a cuda enabled container based on Ubuntu 20.04., and I've installed cuda, cudnn8, pre-compiled libTorch with modern ABI (building torch has too many headaches itself), MKL, libjpeg, libpng, protobuf, protobuf-compiler, python3-dev, ninja-build, cmake 3.19.5. I've also enabled headless build.

When I used cuda 11.3 (which would match the current libtorch release), ADOP fails to build - or rather, when compiling PointRenderer.cu it stalls and remains on that step for > 24 hours.

When I use cuda 11.2 or 11.4 I can get all the way through compilation, but the linking stage produces undefined references to functions in your Saiga library, despite including the Saiga libraries on the compile command.

I've attached a file with the first linker error, and also my Dockerfile incase it can help - I suspect that I must be just missing some dependency, or have the wrong version of some dependency, given that I have one machine that did manage to build on, but I'm a bit stuck as to what it is now, so any help greatly appreciated.

ADOP-link-error.txt Dockerfile-ADOP.txt

Gatsby23 commented 2 years ago

I'm trying to build ADOP without Conda so I can run it on a remote machine - the only machine I have access to with powerful enough GPU - for which I need to run with a Docker container.

I have managed to build on my local machine, but no matter what settings I use on my trivial test dataset it fails to allocate memory on that machine's "meagre" 8GB 1070.

Following the same procedure that gave me success, I believe I've installed all relevant dependencies. The base container is a cuda enabled container based on Ubuntu 20.04., and I've installed cuda, cudnn8, pre-compiled libTorch with modern ABI (building torch has too many headaches itself), MKL, libjpeg, libpng, protobuf, protobuf-compiler, python3-dev, ninja-build, cmake 3.19.5. I've also enabled headless build.

When I used cuda 11.3 (which would match the current libtorch release), ADOP fails to build - or rather, when compiling PointRenderer.cu it stalls and remains on that step for > 24 hours.

When I use cuda 11.2 or 11.4 I can get all the way through compilation, but the linking stage produces undefined references to functions in your Saiga library, despite including the Saiga libraries on the compile command.

I've attached a file with the first linker error, and also my Dockerfile incase it can help - I suspect that I must be just missing some dependency, or have the wrong version of some dependency, given that I have one machine that did manage to build on, but I'm a bit stuck as to what it is now, so any help greatly appreciated.

ADOP-link-error.txt Dockerfile-ADOP.txt

Hey, have you solved this problem ? I have the same problem with you

mureva commented 2 years ago

I've managed to have the build complete using a Dockerfile posed by another user in a comment, with a couple of small adjustments. I've not tested that the build works yet, but at least it builds. See here for the original : I made two small changes, first the 'FROM' line: FROM nvidia/cuda:11.4.2-devel-ubuntu20.04 and then later on to change RUN ./install_pytorch.sh to RUN ./install_pytorch_source.sh