Fahey-McLay / xalt

28 stars 15 forks source link

Changes to support injecting XALT inside Singularity containers #42

Closed samcmill closed 6 years ago

samcmill commented 6 years ago

You can inject XALT inside Singularity containers by doing the following:

$ SINGULARITYENV_LD_PRELOAD=/host/path/to/xalt/lib64/libxalt_init.so SINGULARITY_BINDPATH="/host/path/to/xalt" SINGULARITY_CONTAINLIBS="/usr/lib64/libdcgm.so.1" singularity shell --nv docker://ubuntu:16.04

If you are not tracking GPU usage, then SINGULARITY_CONTAINLIBS and --nv can be omitted.

These environment variables can be added to the XALT module file to automatically enable XALT tracking of workloads inside Singularity containers.

However, this exposes a few issues which are addressed by this pull request.

  1. XALT is linked to libcrypto.so. However the path and version of this library varies depending on the Linux distribution. So depending on the operating system of the host and the container, there may be a mismatch resulting in a user job failure since libxalt_init.so cannot be loaded. E.g., a CentOS host and a Ubuntu container image. Add a configure option --with-staticLibs that will statically link libraries into XALT. Currently, this only triggers static linking with libcrypto and NVIDIA DCGM. You must configure with --with-staticLibs in order to confidently use XALT with Singularity containers.

  2. The container image may not include the file utility. The ubuntu:16.04 container image is an example. XALT will segfault because it assumes there is output from this shell command. Add a check to ensure there is output before trying to access it.

  3. Related to item 2, if the file utility is not present, then an error message is printed to stderr. Modify capture() to swallow anything written to stderr so the user does not see this.

With these changes:

$ SINGULARITYENV_XALT_TRACING=yes SINGULARITYENV_LD_PRELOAD=/tmp/usr/local/xalt/xalt/lib64/libxalt_init.so SINGULARITY_BINDPATH="/tmp/usr/local/xalt/xalt" SINGULARITY_CONTAINLIBS="/usr/lib64/libdcgm.so.1" singularity exec --nv ~/ubuntu1604.simg ~/peer
...
---------------------------------------------
 Date:          Thu Oct 11 14:12:38 2018
 XALT Version:  XALT 2.3.12
 Nodename:      ivb125
 System:        Linux
 Release:       3.10.0-862.9.1.el7.x86_64
 O.S. Version:  #1 SMP Mon Jul 16 16:29:36 UTC 2018
 Machine:       x86_64
 Syshost:       psg
---------------------------------------------

myinit(LD_PRELOAD,/home/smcmillan/peer){
  Test for __XALT_INITIAL_STATE__: "(NULL)", STATE: "LD_PRELOAD"
  Test for XALT_EXECUTABLE_TRACKING: yes
  Test for rank == 0, rank: 0
  GPU tracing
    -> XALT is build to track all programs, Current program is a scalar program -> Not producing a start record
}

max error: 1.192093e-07
max error: 1.192093e-07

myfini(LD_PRELOAD){
  GPU tracing
  4 GPUs detected
  GPU 0: num compute pids 1
  GPU 1: num compute pids 1
  GPU 2: num compute pids 0
  GPU 3: num compute pids 0
  2 of 4 GPUs were used
    -> Scalar Sampling program run_time: 0.480286: (my_rand: 0.376318 <= prob: 1) for program: /home/smcmillan/peer
  len: 32, b64_cmd: WyIvaG9tZS9zbWNtaWxsYW4vcGVlciJd
  Recording State at end of scalar user program:
    LD_LIBRARY_PATH= PATH=/usr/bin:/bin /tmp/usr/local/xalt/xalt//libexec/xalt_run_submission --interfaceV 4 --ppid 24242 --syshost "psg" --start "1539267158.9848" --end "1539267159.4651" --exec "/home/smcmillan/peer" --ntasks 1 --uuid "b668d8ad-35cc-4039-8580-3e30338715cf" --prob 1 --ngpus 2 --path "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" --ld_libpath "/.singularity.d/libs" -- ["/home/smcmillan/peer"]
}

xalt_run_submission(zzz) {
  Built envT
  Extracted recordT from executable
  Built userT, userDT
  Filter envT
  Parsed LDD
  Using XALT_TRANSMISSION_STYLE: file
  Built json string
  Wrote json run file : /home/smcmillan/.xalt.d/run.psg.2018_10_11_14_12_38_9848.zzz.b668d8ad-35cc-4039-8580-3e30338715cf.json
}
samcmill commented 6 years ago

On the mailing list, @treydock verified that this change (--with-staticLibs) has the side effect of resolving #36.