SJTU-IPADS / PhoenixOS

Fast OS-level support for GPU checkpoint and restore
Apache License 2.0
40 stars 7 forks source link

Problems encountered during building from scratch #10

Open 913887524gsd opened 2 days ago

913887524gsd commented 2 days ago

Nice project!

This issue(post?) records the obstacles and solutions I encountered during the construction process. Hope the maintainer can modify the script after seeing this to make the build process smoother.

Environment

docker: 27.1.0 image: nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04 setup command:

sudo docker run -dit --gpus all                                         \
            -v.:/root                                                   \
            --privileged --network=host --ipc=host                      \
            --name phos nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04

Waiting for user input

I used commands in readme to build:

./build.sh -3 -i

It got stuck during the installation of software-properties-common because the process requires user input to confirm time zone information, but there is no way to provide input.

Solution: Manually install software-properties-common or set TZ and DEBIAN_FRONTEND environment vars.

Missing ~/.cargo/env

After completing the first stage of the installation, the script prompted me to source ~/.bashrc. However, after sourcing it, I found that ~/.cargo/env was missing.

Solution: Install the rust toolchain:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Missing header files

When building the Autogen and Remoting components, the process failed, and the log indicated that some header files were missing (see build_log/build_PhOS-Autogen.log and build_log/build_PhOS-Remoting.log for details):

../../pos/cuda_impl/utils/fatbin.h:26:10: fatal error: libelf.h: No such file or directory
   26 | #include <libelf.h>
      |          ^~~~~~~~~~
cpu-utils.c:9:10: fatal error: openssl/md5.h: No such file or directory
    9 | #include <openssl/md5.h>
      |          ^~~~~~~~~~~~~~~
cpu-client-driver.c:7:10: fatal error: vdpau/vdpau.h: No such file or directory
    7 | #include <vdpau/vdpau.h>
      |          ^~~~~~~~~~~~~~~

Solution: Install header files:

apt-get install -y libelf-dev libgl1-mesa-dev libssl-dev libvdpau-dev

Missing dynamic library

After completing the installation, I tried to launched hijack library using LD_PRELOAD, but it failed due to a missing libtirpc.so.3. I could only find /usr/lib/x86_64-linux-gnu/libtirpc.so.

Solution: Run the ldconfig command to generate libtirpc.so.3.

Hijacking failed

I tested the hijack with a hello world CUDA program, but no runtime APIs were hijacked. Running the ldd command to check library dependencies showed that no runtime library was included. It seemed that nvcc forces runtime library to be statically linked in user program binary.

Solution: Add the --cudart=shared argument to force dynamic linking of the CUDA runtime in the user program.

wxdwfc commented 1 day ago

Thank you so much for your troubleshooting! We will check that and revise the doc accordingly :)