amazon-archives / amazon-dsstne

Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models
Apache License 2.0
4.41k stars 731 forks source link

Build issues #221

Open miketempleman opened 5 years ago

miketempleman commented 5 years ago

I have encountered two separate issues when trying to build the Docker version of dsstne. I am using the DSSTNE CUDA 9.1 (ami-fe173884) ami on a g2.8xlarge instance in us-east-1.

First, I cannot run the driver information app. Whenever I try to run:

nvidia-docker run --rm nvidia/cuda nvidia-smi

The response is:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown.

I know the nvidia-smi app is there:

whereis nvidia-smi
nvidia-smi: /usr/bin/nvidia-smi /usr/share/man/man1/nvidia-smi.1.gz

And if I simply run the nvidia-smi app from bash I see that the driver is installed:

`+-----------------------------------------------------------------------------+ | NVIDIA-SMI 387.26 Driver Version: 387.26 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |==========================+======================+======================| | 0 GRID K520 Off | 00000000:00:03.0 Off | N/A | | N/A 28C P8 17W / 125W | 11MiB / 4036MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |========================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ` But somehow the $PATH for nvidia-docker does not point to it. Do I need to build nvidia-docker to resolve this $PATH problem? Or use another ami? I see that there are other dsstne amis in us-east-1.

Second, if I go ahead and try to build dsstne from the repository using latest, I see a warning from the makefile:

Step 13/15 : RUN cd /opt/amazon/dsstne/src/amazon/dsstne && make install ---> Running in 8545a8860f50 Makefile:6: ************************************************************************************** Makefile:7: ****************** USE OF DEPRECATED MAKEFILE ****************** Makefile:8: ****************** PLEASE USE THE ONE AT THE ROOT OF THE REPOSITORY ****************** Makefile:9: **************************************************************************************

And the make fails with the error: mkdir -p /opt/amazon/dsstne/src/amazon/../../amazon-dsstne cp -rfp /opt/amazon/dsstne/src/amazon/../../../build/lib /opt/amazon/dsstne/src/amazon/../../amazon-dsstne/lib cp: cannot stat '/opt/amazon/dsstne/src/amazon/../../../build/lib': No such file or directory make: *** [install] Error 1 Makefile:26: recipe for target 'install' failed The command '/bin/sh -c cd /opt/amazon/dsstne/src/amazon/dsstne && make install' returned a non-zero code: 2

When I change the Dockerfile to use the Makefile at the root of the repo, the build fails with:

In file included from src/main/native/com_amazon_dsstne_Dsstne.cpp:20:0: src/main/native/jni_util.h:21:17: fatal error: jni.h: No such file or directory compilation terminated. make[1]: *** [target/native/build/com_amazon_dsstne_Dsstne.o] Error 1

At this point I am at an impasse. I did try following the setup instructions using the community ami Amazon DSSTNE (nvidia-docker) - ami-25c0eb32 but encountered the same error.

My next step is to try to rebuild nvidia-docker and then continue to grind through the dsstne docker build. But I hope that someone can let me know what I am doing wrong before I spend another day on this task instead of working with dsstne.

Mike Templeman

mmwillet commented 5 years ago

@miketempleman I encountered those same issues. We addressed the former issue by using the non-deprecated makefile (which you did). Specifically I changed this run command in the docker file to RUN cd /opt/amazon/dsstne && \ make install. The latter problem seems to have to do with the fact that we are missing "jni.h" and "jni_md.h". We addressed this problem (probably improperly) by adding RUN apt-get --yes install openjdk-8-jdk before the make command in the dockerfile. Finally the predict script that the documentation suggests should be used cannot be found in the entry path and has to be called with its absolute path like so $ nvidia-docker run --rm -it amazon-dsstne /opt/amazon/dsstne/build/bin/predict. The documentation in this repository should probably be changed and the dockerfile should probably be fixed.