calab-ntu / gpu-cluster

Eureka and Spock GPU clusters
3 stars 0 forks source link

Installation steps and stress test #38

Open xuanweishan opened 2 years ago

xuanweishan commented 2 years ago

Installation

Installation of hard wares

  1. Component list:
    • CPU: AMD Threadripper Pro 5975wx
    • MB: ASUS Por WS WRX80 SAGE SE WIFI (Temporary)
    • RAM: G.skill F4-3200C16-32GVK * 8
    • GPU: MSI RTX 3080 Ti OC
    • SSD: 500G
    • PSU: 1200W
  2. Install components and check VGA ctl switch (It should be turn off)
  3. Setup BIOS
    1. Update BIOS version to 1003.
    2. Update firmware to 1.17.0
  4. Open O.H.C.P

x. Install CUDA driver sh /work1/xuanshan/cuda_11.7.1_515.65.01_linux.run --silent --driver X. InfiniBand driver ./mlnxofedinstall --force

Temperature performance

CPU stress test

  1. Install IPMI driver and tool
    yum -y install OpenIPMI ipmitool
  2. Enable ipmi serviec
    systemctl enable ipmi.service
  3. Check CPU temperature
    ipmitool sensor get "CPU Temp." 

GPU stress test

xuanweishan commented 1 year ago

Login node installation. (CentOS 7)

Install system in raid 1 ref. https://www.thegeekdiary.com/how-to-install-centos-rhel-7-on-raid-1-partition/

  1. Installation system
  2. Installation destination
  3. Choose both disks we want to build raid.
  4. Choose I will configure partitioning
  5. Click create them automatically.
  6. Choose /home and click - to delete it.
  7. Select Device Type to be RAID, choose RAID LEVEL to RAID1 and write your personal label, such as md0.
  8. Similarly, you can set the Device type and labels for swap and /boot partitions
  9. Installation summary:
    • [x] Additional Development
    • [x] Compatibility Libraries
    • [x] Development Tools
    • [x] Email Server
    • [x] Emacs
    • [x] File and Storage Server
    • [x] Hardware Monitoring Utilities
    • [x] Identity Management Server
    • [x] Infiniband Support
    • [x] KDE
    • [x] Legacy X Window System Compatibility
    • [x] Network File System Client
    • [x] Platform Development
    • [x] Python
    • [x] Technical Writing
    • [x] Security Tools
    • [x] System Administration Tools
  10. Set up network
    Edit /etc/sysconfig/network-scripts/ifcfg-XXX, where XXX is usually "enpXXX"
    BOOTPROTO=static 
    ONBOOT=yes 
    IPADDR=192.168.0.200 
    GATEWAY=192.168.0.1 
    NETMASK=255.255.255.0 
    DNS1=140.112.254.4
  11. NFS client
    • Mount all remote folders
      Do not mount /software
      mount /software; mount /home; mount /work1; mount /project; mount /projectX; mount /projectY; mount /projectZ
  12. Install the CUDA driver
  13. Copy initialization scripts
    
    cp /work1/shared/spock/init_script/*.sh  /etc/profile.d/
    cp /work1/shared/spock/init_script/*.csh /etc/profile.d/
    cp /work1/shared/spock/etc/rc.local /etc/rc.d/
    chmod +x /etc/rc.d/rc.local
    ```
    Edit `/etc/rc.d/rc.local` as follows
    1. Comment out the line `/usr/bin/nvidia-persistenced --verbose`
    2. Comment out the line `nvidia-cuda-mps-control -d`
    3. Replace `nvidia-smi -i 0 -c EXCLUSIVE_PROCESS` by `nvidia-smi -i 0 -c PROHIBITED`
  1. TORQUE

    cd /work1/shared/spock/package/torque/src/torque-3.0.6

    Login node only: edit Fish_Install.sh to enable —enable-server

    • WARNING: do NOT run Fish_Install.sh in parallel (i.e., install one node at a time)
      sh Fish_Install.sh >& log.spockXX
      cd ../../etc
      cp pbs /etc/init.d/

      Edit pbs.conf to set start_server=1 and start_mom=0

      cp pbs.conf /etc/
      cp nodes_spock /var/spool/TORQUE/server_priv/nodes
      systemctl enable pbs
      source /etc/profile.d/torque.sh
      cd ../src/torque-3.0.6/
      ./torque.setup root
      killall pbs_server
      systemctl start pbs # This error message is fine: "LOG_ERROR::No such file or directory (2) in read_config, fstat: config"
      systemctl status pbs
  2. [Optional] Create the SSH key of root

    ssh-keygen -t rsa
    cd ~/.ssh
    cp id_rsa.pub authorized_keys
    cp id_rsa* authorized_keys /work1/shared/spock/ssh_root/
  3. NFS server

    systemctl enable nfs
    systemctl start nfs
    cp /work1/shared/spock/etc/exports /etc/
    exportfs -ra
    showmount -e spock00 # /software 192.168.0.0/24
    

    Comment out spock00:/software in /etc/fstab

  4. CUDA

    1. Driver

      It should have been installed when following [[System Installation: Computing Node | System Installation: Computing Node]]

    2. Libraries and samples

      mkdir /software/cuda /software/cuda/12.1
      ln -s /software/cuda/12.1 /software/cuda/default

      Install 12.1

      cd /work1/shared/spock/package/cuda/
      sh cuda_12.1.0_530.30.12_linux.run  --silent --toolkit --installpath=/software/cuda/12.1

      WIP

    3. [optional] cuDNN

      • Ref: https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#install-linux

        [Optional] Download the latest version
        https://developer.nvidia.com/cudnn --> Download cuDNN

        1. Installation (the following example adopts CUDA 10.1)
          CUDNN_TMP=/work1/fish/Code_Backup/GPU_Cluster_Setup/eureka/package/cudnn/cudnn-10.1-linux-x64-v7.6.5.32
          CUDA_TMP=/software/cuda/10.1
          cp ${CUDNN_TMP}/include/cudnn.h ${CUDA_TMP}/include
          cp ${CUDNN_TMP}/lib64/libcudnn* ${CUDA_TMP}/lib64
          chmod a+r ${CUDA_TMP}/include/cudnn.h ${CUDA_TMP}/lib64/libcudnn*
        2. Test Log in to a computing node first Do NOT use GNU compiler later than 8
          cd /tmp
          export CUDA_PATH=/software/cuda/10.1
          export PATH=$CUDA_PATH/bin:$PATH
          export LD_LIBRARY_PATH=$CUDA_PATH/lib64:$LD_LIBRARY_PATH
          cp -r /work1/fish/Code_Backup/GPU_Cluster_Setup/eureka/package/cudnn/cudnn_samples_v7 .
          cd cudnn_samples_v7/mnistCUDNN
          make clean && make -j 16
          ./mnistCUDNN # Test passed!
          cd ../../
          rm -rf cudnn_samples_v7
    4. Test

      cd /software/cuda/12.0/NVIDIA_CUDA-10.2_Samples
      
      ./1_Utilities/deviceQuery/deviceQuery

      Verify that

      1. 1st line: "Detected 1 CUDA Capable device(s)"
      2. 2nd line: "Device 0: "GeForce RTX 3080 Ti"
      3. Last line: "Result = PASS"
        ./1_Utilities/bandwidthTest/bandwidthTest

        Verify that

      4. Host <-> Devince bandwidth ~13 GB/s.
      5. "Result = PASS"
    5. [Optional] Reset the default mode to graphical.target (a.k.a. runlevel 5)

      systemctl set-default graphical.target
      systemctl get-default # graphical.target
  5. Intel compiler

    1. Create directory and link for intel compiler.

      mkdir /software/intel
      ln -s /software/intel /opt

      1.oneAPI Basic & HPC toolkit

      1. Install Basic toolkit

        cd /software/intel
        cp /work1/shared/spock/package/intel/l_BaseKit_p_2023.0.0.2553.sh ./
        
        sudo bash l_BaseKit_p_2023.0.0.25537.sh
      2. Install HPC toolkit

        cp /work1/shared/spock/package/intell_HPCKit_p_2023.0.0.25400.sh ./
        
        sudo bash l_HPCKit_p_2023.0.0.25400.sh
    2. After installation

      mv /opt/intel/licenses /opt/intel/.pset /software/intel
      rm -rf /opt/intel
      cd /opt
      ln -s /software/intel
      cd /software/intel
      ln -s /work1/shared/spock/package/intel src

      Check /etc/profile.d/intel.sh and, if necessary, replace opt by software

  6. Valgrind

    mkdir -p /software/valgrind
    cd /work1/shared/spock/package/valgrind/valgrind-3.15.0
    # [Optional] Edit Fish_Install.sh
    sh Fish_Install.sh >& log.spock

    After installation

    cd /software/valgrind
    ln -s /work1/shared/spock/package/valgrind src
    ln -s 3.15.0 default

    Check /etc/profile.d/valgrind.sh

  7. OpenMPI

    mkdir -p /software/openmpi
    cd /software/openmpi/src/openmpi-4.1.1
    # [Optional] Edit Fish_Install_with_UCX.sh (remember to un-comment the configuration flags)
    sh Fish_Install_with_UCX.sh >& log.spock-intel

    After installation

    1. check linking to UCX libraries
      cd /software/openmpi/4.1.1-intel-oneapi/bin
      objdump -p mpicxx | grep PATH    # see whether /software/openucx/ucx-1.12.0/lib is in RPATH
      ldd mpicxx | grep ucx            # see whether dynamic linker can find UCX libraries
    2. make soft links
      cd /software/openmpi
      ln -s /work1/shared/spock/package/openmpi src
      unlink default                   # optional, if default already existed
      ln -s /software/openmpi/4.1.1-intel-oneapi default
    3. Check /etc/profile.d/openmpi.sh
    4. Check debugger
      ompi_info | grep "MCA memchecker" # MCA memchecker: valgrind (MCA v2.1.0, API v2.0.0, Component v3.1.5)
    5. Set the MCA parameters [Optional]
      1. Edit the configure file /software/openmpi/4.1.1-intel-oneapi/etc/openmpi-mca-params.conf by root (for querying the path for configure file, one can use: ompi_info --params mca all --level 9 | grep mca_param_files). Add the below lines (2021/07/24):
        pml=ucx                                       # include only ucx for pml
        osc=ucx                                       # include only ucx for osc
        btl=^openib                                   # exclude openib from btl

        which works for OpenMPI 4.1.1 and UCX 1.12.0, without giving warning message:

        [eureka01:45901] common_ucx.c:364 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.
      2. Reference for excluding openib if UCX library is installed.
      3. Reference for setting MCA parameters by configure file (10. How do I set the value of MCA parameters?, 4. Files)
  8. Maui

    cd /work1/fish/Code_Backup/GPU_Cluster_Setup/spock/package/maui/maui-3.3.1
    # [Optional] Edit Fish_Install.sh
    sh Fish_Install.sh >& log.spock
    cd etc/
    cp maui.d /etc/init.d/
    cp maui.sh maui.csh /etc/profile.d/
    systemctl enable maui.d
    
    cd /usr/local/maui

    Edit maui.cfg as follows (an example is put at maui-3.3.1/maui.cfg.eureka)

         RMPOLLINTERVAL 00:00:15
         #BACKFILLPOLICY         FIRSTFIT
         #RESERVATIONPOLICY      CURRENTHIGHEST
         #NODEALLOCATIONPOLICY   MINRESOURCE
    
         # <==== Add by Nelson ====>
         JOBAGGREGATIONTIME      00:00:04
    
         # Backfill
    
         BACKFILLPOLICY          FIRSTFIT
         RESERVATIONPOLICY       NEVER
    
         # Node Allocation
    
         NODEALLOCATIONPOLICY    FIRSTAVAILABLE
    
         # Set Job Flags
         JOBACTIONONNODEFAILURE  CANCEL
         JOBNODEMATCHPOLICY      EXACTNODE
    
      systemctl start maui.d
      source /etc/profile.d/maui.sh
  9. Other packages

    1. screen, pdsh

      yum -y install screen
      yum -y install pdsh
    2. FFTW

      mkdir -p /software/fftw
      cd /work1/shared/spock/package/fftw/fftw-2.1.5-revised
      # [Optional] Edit Fish_Install.sh
      sh Fish_Install.sh >& log.spock-intel

      After installation

      cd /software/fftw
      ln -s /work1/shared/spock/package/fftw src
      ln -s 2.1.5-intel default
    3. HDF5

      mkdir -p /software/hdf5
      cd /work1/shared/spock/package/hdf5/hdf5-1.10.6
      # [Optional] Edit Fish_Install.sh
      sh Fish_Install.sh >& log.spock

      After installation

      cd /software/hdf5
      ln -s /work1/shared/spock/package/hdf5 src
      ln -s 1.10.6 default
    4. GSL

      mkdir -p /software/gsl
      cd /work1/shared/spock/package/gsl/gsl-2.6
      # [Optional] Edit Fish_Install.sh
      sh Fish_Install.sh >& log.spock

      After installation

      cd /software/gsl
      ln -s /work1/shared/spock/package/gsl src
      ln -s 2.6 default
    5. gnuplot

      mkdir -p /software/gnuplot
      cd /work1/shared/spock/package/gnuplot/gnuplot-5.2.8
      # [Optional] Edit Fish_Install.sh
      sh Fish_Install.sh >& log.spock

      After installation

      cd /software/gnuplot
      ln -s /work1/shared/spock/package/gnuplot src
      ln -s 5.2.8 default
    6. Latest GNU compiler

    7. UCX library

  10. Miscellaneous

    1. Language
      Add the following lines to /etc/environment

      LANG=en_US.utf-8
      LC_ALL=en_US.utf-8 
    2. Check

      1. After installation finished check it in the applications > utilities > disks It will show two disks with the same partition settings and 4 RAID-1 array under.
xuanweishan commented 1 year ago

Computing node installation. (CentOS 7)

0. Check switches settings on MB

  1. VGA switch -> off

  2. IMPI switch -> left(default)

  3. PSU(PHANDEKS ) hybrid -> press down

    1. Set up BIOS

  4. Update BIOS

    1. Check BIOS version : Main -> BIOS Information -> Version

      if Version = 1003 x64, then skip the steps in `Update BIOS

    2. Plug in the USB disk with "BIOS" label to USB socket labeled with "BIOS".
    3. Press power switch to boot up.
    4. Keep pressing delete and F2 during booting to get in to BIOS.
    5. Tool -> ASUS EZ Flash 3 Utility
    6. Find the folder PRO_WS_WRX80E-SAGE_SE_WIFI-ASUS-1003(1)
    7. Find the file PRO-WS-WRX80E-SAGE-SE-WIFI-ASUS-1003.CAP
    8. Yes
    9. Reboot with save change and exit or press F10.
    10. Check: Main -> BIOS Information -> Version 1003 x64
  5. Set up RAM frequency

    1. Ai Tweaker -> Choose D.O.C.P
    2. D.O.C.P -> Choose D.O.C.P DDR4-3200 16-18-18-38-1.35V
    3. Reboot.
    4. Check: Main -> Total Memory : 262144 MB -> Speed : 3200 MHz
  6. Enable NUMA

    1. Advanced -> AMD CBS -> DF Common Option -> Memory Addressing
    2. NUMA nodes persocket -> Choose NPS2
  7. Set up boot disk

    1. Plug in the USB with
    2. Boot > Choose USB to boot
    3. Reboot.

      2. Install CentOS7.

      Ref. https://github.com/calab-ntu/eureka/wiki/System-Installation%3A-Computing-Node#2-install-centos-7

  8. Boot up > Test this media & install CentOS 7

  9. Installation summary:

    1. LANGUAGE: English (United States)
    2. DATE & TIME: Asia, Taipei
    3. SOFTWARE SELECTION: Development and Creative Workstation
      • [x] Additional Development
      • [x] Compatibility Libraries
      • [x] Development Tools
      • [x] Emacs
      • [x] File and Storage Server
      • [x] Hardware Monitoring Utilities
      • [x] Infiniband Support
      • [x] Legacy X Window System Compatibility
      • [x] Network File System Client
      • [x] Platform Development
      • [x] Python
      • [x] Technical Writing
    4. INSTALLATION DESTINATION: Select INTEL SSD Partitioning -> I will configure partitioning -> Done -> Click here to create them automatically -> Verify that the total space is about 500 GB (**** GiB in my test) -> /home: click - to remove it since we will mount a remote home /: 448 GiB swap: 16 GiB -> 576.99 M left. -> Done -> Accept Changes
    5. NETWORK & HOST NAME: Ethernet -> ON Host name (on the bottom left corner): spockXX -> Apply
    6. After finishing (a)-(e) above, press Begin Installation (ignore the warning about SMT)
    7. Set root password

      The installation will take ~10 min

    8. Finish installation.
      1. Reboot -> Accept the EULA agreement -> FINISH CONFIGURATION
      2. Location Services: OFF Time Zone: Taipei, Taiwan Connect Your Online Accounts: Skip About You: (this account will be removed later) Full Name : tmp_account Username : tmp_account Password : ****
    9. Unplug the USB stick
    10. Validate kernel (switch to root)
      uname -r # 3.10.0-1062.el7.x86_64
    11. Set up network Edit /etc/sysconfig/network-scripts/ifcfg-XXX, where XXX is usually "enpXXX"
      BOOTPROTO=static
      ONBOOT=yes
      IPADDR=192.168.0.2XX    # Replace XX by the node number (node number = 00 for the 
      login node)
      GATEWAY=192.168.0.1
      NETMASK=255.255.255.0
      DNS1=140.112.254.4
    12. Disable the network interface virbr0 --> It prevents MPI from working properly --> Ref: https://www.thegeekdiary.com/centos-rhel-67-how-to-disable-or-delete-virbr0-interface/
      virsh net-autostart default --disable
    13. Disable SELinux and firewall (necessary for torque)

      setenforce 0
      sed -i 's/^SELINUX=.*/SELINUX=disabled/g' /etc/selinux/config # or just edit "/etc/selinux/config" directly
      systemctl stop firewalld
      systemctl disable firewalld
      
      sestatus # Check the SELinux status
                    # --> Before reboot, it should show
                    #     SELinux status:                 enabled
                    #     Current mode:                 permissive
                    # --> After reboot, it should show
                    #     SELinux status:                 disabled
    14. Shut down the system, move it to the computer room, and plug in the Ethernet cable
    15. Boot -> validate network Ethernet port LED is on Switch LED is on (orange for 10 Gb/s)
      ping 192.168.0.200
      ifconfig: No virbr0 interface

      3. System setup

  10. Remove the temporary user

         userdel -r tmp_account
  11. NFS client

    1. Auto mount
       scp 192.168.0.200:/work1/shared/pluto/etc/hosts /etc/hosts
       ssh pluto00 cat /work1/shared/pluto/etc/fstab >> /etc/fstab
    2. Create folders
       mkdir /software /work1 /project /projectX /projectY /projectZ
    3. Check the accessibility of the target NFS servers:

       showmount -e pluto00    # /software 192.168.0.0/24
      
       showmount -e tumaz       # /home 192.168.0.0/24
      
       showmount -e ironman    # /volume1/gpucluster1 192.168.0.0/24
                                                 # /volume2/gpucluster2 192.168.0.0/24
                                                 # /volume3/gpucluster3 192.168.0.0/24
      
       showmount -e eater         # /volume1/gpucluster3 192.168.0.0/24
                                                 # /volume2/gpucluster4 192.168.0.0/24
    4. Mount all remote folders

       mount /software; mount /home; mount /work1; mount /project; mount /projectX; mount /projectY; mount /projectZ
      
       df # Check if all folders have been mounted
  12. CUDA

    1. Update system

      yum -y update
    2. Set the text mode as default (since the NVIDIA driver cannot be installed while X window is running)

       systemctl set-default multi-user.target

      Check it with

       systemctl get-default   # It should show "multi-user.target" instead of "graphical.target"
    3. Reboot

    4. Install EPEL and ELRepo

       # [Optional] yum may fail due to incorrect local system time (which causes failure of certification)
       #            --> Solution 1: reset system time directly with, for example, 'date -s "2020-03-22 19:06:26"'
       #            --> Solution 2: enable NTP by following step 3-(5) below
      
       yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
       yum -y install https://www.elrepo.org/elrepo-release-7.0-4.el7.elrepo.noarch.rpm
    5. Install dkms

       yum -y install dkms
    6. Disable the display driver nouveau

      1. Add the following line to the end of GRUB_CMDLINE_LINUX in /etc/default/grub

        rd.driver.blacklist=nouveau nouveau.modeset=0

        Alternatively, one can simply copy it from

        cp /work1/shared/pluto/etc/grub /etc/default/grub
      2. Execute the following

        grub2-mkconfig -o /boot/grub2/grub.cfg
        
        echo "blacklist nouveau"         >> /etc/modprobe.d/blacklist.conf
        echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist.conf
        
        mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
        dracut /boot/initramfs-$(uname -r).img $(uname -r)
        grub2-mkconfig -o /boot/grub2/grub.cfg
      3. Reboot

      4. Verify that the nouveau driver is not loaded

        lsmod | grep nouveau # It should print nothing
    7. Install the CUDA driver

      1. [Optional] Disable persistence and MPS daemons if you are upgrading an existing driver

        nvidia-smi -pm 0
        echo quit | nvidia-cuda-mps-control
      2. Install with

        sh /work1/shared/pluto/package/cuda/cuda_12.1.0_530.30.02_linux.run --silent --driver
      3. Validate with

        cat /proc/driver/nvidia/version # NVIDIA driver: 530.30.02 gcc: 4.8.5
      4. Copy initialization scripts

        cp /work1/shared/pluto/init_script/*.sh  /etc/profile.d/
        cp /work1/shared/pluto/init_script/*.csh /etc/profile.d/
        
        cp /work1/shared/eureka/etc/rc.local /etc/rc.d/
        chmod +x /etc/rc.d/rc.local
      5. [Optional] Comment out the line nvidia-cuda-mps-control -d in /etc/rc.d/rc.local to disable MPS on nodes running tensorflow on pre-Volta GPUs (e.g., eureka32 and eureka33). It is because stream callbacks are not supported on pre-Volta MPS clients; see https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf.

      6. [Optional] Edit /etc/rc.d/rc.local to make cuda double GPU node could identify both GPUs

        export CUDA_VISIBLE_DEVICE=0,1
        nvidia-smi -i 0,1 -c EXELUSIVE_PROCESS
      7. Reboot

      8. Validate the following

        nvidia-smi -q | grep Persistence      # Enabled
        nvidia-smi -q | grep "Driver Version" # 525.85.12
        nvidia-smi -q | grep "CUDA Version"   # 12.0
        nvidia-smi -q | grep "Compute Mode"   # Exclusive_Process
        
        ps -ef | grep mps # nvidia-cuda-mps-control -d
  13. NIS client

    1. Install

      yum -y install ypbind
    2. Configure

      setup # [Authentication configuration] -> [Use NIS]
      Domain: tumaz.gpucluster.calab
      IP: tumaz
      systemctl is-enabled ypbind # enabled
    3. Do not create the perl5 directory by setting PERL_HOMEDIR=0 in

      /etc/profile.d/perl-homedir.sh
    4. Reboot

    5. Validate the following

      yptest          # "1 tests failed"
                          # "Test 9: yp_all" should list all accounts
                          # "Test 3" may fail with the message "WARNING: No such key in map (Map passwd.byname, key nobody)"
      ypwhich        # tumaz.gpucluster.calab
      
      ls -l /home    # Show correct user and group names instead of just UID and GID
      
      # Log in using an existing user account
  14. NTP client

    yum -y install ntp
    systemctl start ntpd
    systemctl enable ntpd
    systemctl is-enabled ntpd # enabled
  15. TORQUE

    1. Install the required packages

      yum -y install tcl-devel tk-devel
    2. Install torque

      cd /work1/shared/pluto/package/torque/src/torque-3.0.6
      # WARNING: do NOT run "Fish_Install.sh" in parallel (i.e., install one node at a time)
      sh Fish_Install.sh >& log.plutoXX
      cd ../../etc
      cp pbs /etc/init.d/
      # Login node only: edit "pbs.conf" to set "start_server=1" and "start_mom=0"
      cp pbs.conf /etc/
      cp nodes_pluto /var/spool/TORQUE/server_priv/nodes
      systemctl enable pbs
      source /etc/profile.d/torque.sh
      cd ../src/torque-3.0.6/
      ./torque.setup root
      killall pbs_server
      systemctl start pbs # This error message is fine: "LOG_ERROR::No such file or directory (2) in read_config, fstat: config"
      systemctl status pbs
    3. Check

      cat /var/spool/TORQUE/pbs_environment # LANG=en_US.utf-8
  16. InfiniBand

    1. Ref. https://docs.nvidia.com/networking/display/MLNXOFEDv571020
    2. Check hardware

      lspci -v | grep Mellanox
      
      # 09:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
      # Subsystem: Mellanox Technologies Device 0003
    3. Uninstall conflicting Intel Omni-Path packages first
      yum -y remove opa-libopamgt opa-address-resolution
      cd /work1/shared/pluto/package/ib/adaptor/driver/MLNX_OFED_LINUX-5.7-1.0.2.0-rhel7.9-x86_64
      ./mlnxofedinstall -h # Print usage
    4. Try one of following commands
      ./mlnxofedinstall                          # update firmware automatically [use this until new firmware release]
      ./mlnxofedinstall --fw-image-dir /tmp/my_fw_bin_files # specify the firmware (see above for the currently used version)
      ./mlnxofedinstall --without-fw-update                           # no firmware update
    5. [Optional] If the installation failed, try following command.
      ./mlnxofedinstall --add-kernel-support
      dracut -f
    6. Reboot.
  17. Miscellaneous setup

    1. IPMI driver and tool

      yum -y install OpenIPMI ipmitool
    2. Enable ipmi serviec

      systemctl enable ipmi.service
    3. Check CPU temperature

      ipmitool sensor get "CPU Temp." 
    4. gnuplot, htop

       yum -y install gnuplot htop
    5. ssh without password for the root

       cd /work1/shared/pluto/ssh_root/
       cp authorized_keys id_rsa* /root/.ssh/
      
       # Verification
       ssh pluto00   # "yes" to "continue connecting"
       ssh plutoXX   # "yes" to "continue connecting"
       exit
       exit
    6. Intel compiler

       cd /opt
       ln -s /software/intel
       rm -rf ./rh # not related to Intel, actually
    7. ffmpeg

       rpm --import http://li.nux.ro/download/nux/RPM-GPG-KEY-nux.ro
       rpm -Uvh http://li.nux.ro/download/nux/dextop/el7/x86_64/nux-dextop-release-0-5.el7.nux.noarch.rpm
       yum -y install ffmpeg ffmpeg-devel
    8. Time stamp of command history

      1. su
      2. Add export HISTTIMEFORMAT='%d/%m/%y %T ' to the end of file /etc/profile
      3. source /etc/profile
      4. Check by history
  18. Python

    1. Python2

      • [For current system] sh /work1/shared/pluto/package/python2/install-python-packages.sh

      • [Optional] If the installation of mpi4py fails (which somehow freezes the screen and causes failure of yt after reboot),

         # follow the steps below:
         pip uninstall mpi4py
         cd /usr/lib64/python2.7/site-packages
         rm -rf yt yt-3.5.1.dist-info
         pip install yt
         pip install mpi4py
        
         # check dependency
         pip check ipython numpy cython jupyter h5py scipy astropy matplotlib yt mpi4py # No broken requirements found.
        
         # check MPI
         su YOUR_ACCOUNT
         mpirun -n 16 python -m mpi4py.bench helloworld # Hello, World! I am process ? of 16 on plutoXX.
        
         # check yt
         cd /tmp
         cp -r /work1/shared/eureka/test/yt ./test-yt
         cd test-yt
         sh test.sh # It should generate two PNG images Data_000000_Projection_z_density.png & Data_000001_Projection_z_density.png
         cd ..
         rm -rf test-yt
        
         # check HDF5
         python -c 'import h5py; print( "%s"%h5py.version.hdf5_version )' # 1.0.6
    2. Python3
      1. Install python3 with source code:
        1. cd /work1/shared/pluto/python3/Python-3.9.10
        2. ./configure --enable-optimizations --enable-shared
        3. make altinstall
        4. cp libpython3.9.so.1.0 /usr/lib64/
        5. Check python3.9
        6. Make link python3 to python3.9: rm /usr/bin/python3 ln -s /usr/local/bin/python3.9 /usr/bin/python3
      2. Upgrade pip for python3: pip3 install -U pip
      3. Install necessary packages:
        pip3 install --ignore-installed pyparsing
        pip3 install --upgrade dnspython
        pip3 install --upgrade pyudev
        pip3 install --ignore-installed --upgrade python-ldap
        pip3 install --upgrade ipython
        pip3 install --upgrade numpy
        pip3 install --upgrade cython
        pip3 install --upgrade six
        pip3 install --upgrade urllib3
        pip3 install idna==2.10
        pip3 install --upgrade certifi
        pip3 install chardet==4.0.0
        pip3 install --upgrade jupyter
        export HDF5_DIR=/software/hdf5/default
        pip3 install --no-binary=h5py h5py
        pip3 install --upgrade scipy
        pip3 install --upgrade astropy
        pip3 install --upgrade matplotlib
        pip3 install --upgrade gitpython
        pip3 install --upgrade pandas
      4. Install yt in python3 from source code.
        1. cp -r /work1/shared/pluto/package/python3/yt /usr/local/lib/python3.9/
        2. cd /usr/local/lib/python3.9
        3. pip3 install -e .
        4. Check:
          1. LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/software/gcc/default/lib64
          2. python3 -c "import yt"
        5. check yt version and update yt.
          1. python3 -m pip install -U yt
          2. yt version
      5. Install other python3 packages
        pip3 install mpi4py
        pip3 install dulwich
        pip3 install girder-client
  19. Check

    1. NUMA

       lscpu | grep NUMA # It should show **2** NUMA nodes
    2. Memory frequency [Need to be confirmed]

       dmidecode -t memory | grep Speed # It should show **Speed: 3200 MT/s; Configured Memory Speed: 2666 MT/s**
    3. [Optional]Queuing system and GAMER performance

       # log in to spock00
       su your_account
       qsub -I -lnodes=spockXX:ppn=16
       cd /tmp
       cp -r /work1/shared/eureka/test/gamer/bin/template/ tmp-gamer
       cd tmp-gamer
       mpirun -map-by ppr:4:socket:pe=4 --report-bindings ./gamer
       awk 'NR>1 {print $7}' Record__Performance # Performance should be ~4.2e7
       cd ..
       rm -rf ./tmp-gamer/
       exit
    4. [Optional] CPU/GPU temperature

      1. CPU: stress-ng
      2. GPU: /work1/xuanshan/gpu-burn/run.sh
    5. Malware scanner ClamAV

      1. Install ClamVA: yum install -y clamav clamd clamav-update
      2. Download and update ClamAV’s official virus signature databases: freshclam
      3. Scan directories: clamscan /etc/* /opt/* /boot/* /usr/* /tmp/* /sys/* /run/* /root/* /var/*
      4. Check result:
        ----------- SCAN SUMMARY -----------
        Engine version: 0.103.5
        Scanned directories: 5
        Scanned files: 2774
        Infected files: 0
        Data scanned: 565.18 MB
        Data read: 668.04 MB (ratio 0.85:1) 
        Time: 50.516 sec (0 m 50 s)
        Start Date: 2022:05:12 10:21:31
        End Date:   2022:05:12 10:22:22

        If Infected files is 0, then pass.

Ref:

xuanweishan commented 1 year ago

Install nodes (Ubuntu 22.04 server)

1. Check switche settings on MB

  1. VGA switch -> off image

  2. IPMI switch -> left (default) image

  3. PSU(PHANDEKS ) hybrid -> press down

  4. Change cooling fan header from CPU_OPT to CHA_FAN1 image

  5. Plug the micro USB plug off the CPU pump.

2. Set up BIOS

  1. Bootup machine with BIOS flash disk plugged in.

    If the machine is boot for the first time, it would ask if you want to initial the CPU config. Press Y to confirm initial.

  2. Check BIOS version an update

    • Get into BIOS with press delete or F2 during booting.
      1. Check BIOS version : Main -> BIOS Information -> Version

        if Version = 1106 x64, then skip the steps in Update BIOS

      2. Plug in the USB disk with "BIOS" label to USB socket labeled with "BIOS".
      3. Keep pressing delete or F2 during booting to get in to BIOS.
      4. Tool -> ASUS EZ Flash 3 Utility
      5. Find the folder PRO_WS_WRX80E-SAGE_SE_WIFI-ASUS-1106
      6. Find the file PRO-WS-WRX80E-SAGE-SE-WIFI-ASUS-1106.CAP
      7. Yes
      8. Reboot with save changes and exit or press F10.
      9. Check again.
  3. DRAM overclock setting

    1. Ai Tweaker -> Ai overclock Tuner -> Choose D.O.C.P
    2. D.O.C.P -> Choose D.O.C.P DDR4-3200 16-18-18-38-1.35V
    3. F10 reboot.
    4. Check: Main -> Total Memory : 262144 MB -> Speed : 3200 MHz
  4. Enable NUMA

    1. Advanced -> AMD CBS -> DF Common Option -> Memory Addressing
    2. NUMA nodes persocket -> Choose NPS2
  5. F10 Reboot

3. Install ubuntu server 22.04

  1. Download ubuntu 22.04 from https://www.ubuntu-tw.org/modules/tinyd0/ Make bootable USB disk with rufus.

  2. Set up boot disk in BIOS

    1. Boot with bootable USB disk plugged in.
    2. Get into BIOS with press delete or F2 during booting.
    3. Boot > Choose USB to boot.
    4. F10 to reboot.
  3. Install Ubuntu 22.04

    1. Choose Try or install ubuntu server
    2. Select language : English -> Done
    3. Keyboard configuration.
      • Layout : English (US)
      • Variant : English (US) -> Done
    4. Choose type of install
      • [x] Ubuntu Server
      • [x] Search for third party drivers -> Done
    5. Network connecions -> Continue without network
    6. Configure proxy -> Done

      Leave the field empty.

    7. Configure Ubuntu archive mirror -> Done

      Don't change the url

    8. Guided storage configuration
      1. Custom storage layout -> Done
      2. Select disks and reformat all of them.
        • [Login node] Install in RAID1 (Redundancy)
          1. Select the Use As Boot Device on both disks
          2. /boot
          3. Choose free space and select Add GPT partition
            • Size : 1G
            • Format : Leave unformat -> Create
          4. Create software RAID :
            • RAID namd : md0
            • Raid type : Raid 1
            • Format : ext4
            • Mount : /boot -> Create
          5. / : Same steps as /boot with changes
            • Size : 914G
          6. swap : Choose the rest of free space and format them as swap
        • [Computing nodes] Normal install
          1. /boot
          2. Choose free space -> Add GPT Partition
            • Size : 1G
            • Format: ext4
            • Mount: /boot -> Create
          3. / : Same steps as /boot with changes
            • Size : 448G
            • Mount : /
          4. swap
            • Size :

              leave empty to get rest of volume

            • Format : swap -> Done -> Continue
    9. Porfile setup
      • Your name: spock**

        ** is the number of node name

      • Your server's name: spock**
      • Pick a username: tmp_account
      • Choose a password: ****
      • Confirm your password: ***
    10. Upgrade to Ubuntu Pro -> Skip Ubuntu Pro setup for now -> Done
    11. SSH Setup -> Done

      Don't check the option.

    12. Third-party drivers
      • [x] Do not install third-party drivers now -> Done
    13. Reboot Now -> Unplug the install medium and press enter to reboot.
  4. Check.

    1. Kernel : uname -r -> 5.15.0-60-generic
    2. CPU : lscpu | grep Model -> AMD Ryzen Threadripper PRO 5975WX 32-Cores
    3. RAM : sudo dmidecode memory | grep Speed
      Configured Speed: 3200 MT/s
      Speed: 2667 MT/s
    4. NUMA : lscpu | grep NUMA
      NUMA node(s)           : 2
      NUMA node0 CPU(s) : 0-15, 32-47
      NUMA node1 CPU(s) : 16-31, 48-63

4. Set up settings

  1. Network settings.

    1. Edit netplan : sudo vim /etc/netplan/00-installer-config.yaml
      # This is the network config written by 'subiquity'
      network:
        ethernets:
          enp*****0:
            dhcp4: true
          enp*****1:
            dhcp4: false
            addresses: [192.168.0.2**/22] # ** would be replaced by the number of node.
            nameservers:
              addresses: [140.112.254.4]
            routes:
              - to: default
                via: 192.168.0.1
      version:2
    2. Apply netplan: sudo netplan apply
    3. Poweroff the machine and move it to machine room.
      • Plug the ethernet cable to the upper ethernet port.
    4. Check
      1. ip settings : ip addr show dev enp*****1
        inet 192.168.0.2**/22
      2. DNS server : resolvectl status
        Link 3 (enp*****1)
            Current Scopes: DNS
                Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
        Current DNS Server: 140.112.254.4
               DNS Servers: 140.112.254.4
      3. ping 192.168.0.150
    5. Get system network informations.
      sudo -i
      scp [your_account]@192.168.0.150:/work1/shared/spock/etc/hosts /etc/hosts
  2. Update system

    1. Operate in sudo privilage sudo -i
    2. apt update
    3. apt-get install -y linux-image-5.15.0-78-generic Press enter twice as kernel update UI appears.
    4. reboot
    5. sudo -i
    6. check : uname -r 5.15.0-78-generic # or above
    7. Change group name of ID 1000 : groupmod --new-name calab tmp_account
    8. Set root password : passwd
    9. Delete /home/tmp_account : rm -r /home/tmp_account
    10. Change sh link from dash to bash:
      sudo dpkg-reconfigure dash
      # Then configure UI will ask if want to set /usr/bin/sh to dash
      # Press "No" to set the /usr/bin/sh to bash
  3. Time stamp of command history

    1. su
    2. Add export HISTTIMEFORMAT='%d/%m/%y %T ' to the end of file /etc/profile
    3. source /etc/profile
    4. Check by history
  4. Set timezone

    1. su
    2. timedatectl set-timezone Asia/Taipei
    3. Check timedatectl show
  5. NFS settings

    1. Client

      1. sudo -i
      2. Install NFS client. apt -y install nfs-common
      3. Get auto mount settings from work1.
        ssh [your_account]@eureka00 cat /work1/shared/spock/etc/fstab >> /etc/fstab

        [Login node only] Comment out the line start from spock00:/software

      4. Create directories. mkdir /software /work1 /projectV /projectW /projectX /projectY /projectZ
      5. Check the accessibility of the target NFS servers
        showmount -e spock00     # /software 192.168.0.0/24 **[Skip on login node]**
        showmount -e tumaz        # /home 192.168.0.0/24
        showmount -e ironman    # /volume1/gpucluster1 192.168.0.0/24
                                                  # /volume3/gpucluster3 192.168.0.0/24
        showmount -e eater         # /volume1/gpucluster3 192.168.0.0/24
                                                  # /volume2/gpucluster4 192.168.0.0/24
                                                  # /volume3/gpucluster6 192.168.0.0/24
        showmount -e pacific       # /volume1/gpucluster1 192.168.0.0/24
      6. Mount all remote directories.
        mount /software; # Skip in process on login node 
        mount /home; mount /work1; mount /projectW; mount /projectX; mount /projectY; mount /projectZ; mount /projectV
      7. Check : df -h

        tumaz:/home                   208G   22G  176G  12% /home
        
        ironman:/volume1/gpucluster1   70T   47T   24T  67% /work1
        ironman:/volume3/gpucluster3   70T   70T  643G 100% /projectX
        
        eater:/volume1/gpucluster3     70T   67T  3.6T  95% /projectY
        eater:/volume2/gpucluster4     88T   77T   12T  88% /projectZ
        eater:/volume3/gpucluster6     88T   75T   13T  86% /projectW
        
        pacific:/volume1/gpucluster1  140T   20T  120T  15% /projectV
    2. Server [Login node only]
      1. Install NIS server : sudo apt -y install nfs-kernel-server
      2. Create and check the directory to be mount : ll /software >(/software not exist) >mkdir /software`
      3. Copy NIS target settings to /etc/exports: cp /work1/shared/spock/etc/exports /etc/exports
      4. Start and enable NIS server :
        systemctl restart nfs-kernel-server.service
        systemctl enable nfs-kernel-server.service
      5. Check NIS server status and result.
        systemctl status nfs-kernel-server.service
        # Active: active (exited)
        showmount -e spock00
        # /software 192.168.0.0/24
  6. NIS settings

    1. Install NIS client. sudo apt -y install nis
    2. Configure as a NIS Client.
      1. vim /etc/yp.conf , add follow text at the end.
        domain tumaz.gpucluster.calab server tumaz
      2. vim /etc/nsswitch.conf
        passwd:         files systemd nis
        group:          files systemd nis
        shadow:         files nis
        hosts:          files dns nis 
      3. Set NIS domain name, vim /etc/defaultdomain
        tumaz.gpucluster.calab
      4. Start and enable nis.
        systemctl restart ypbind
        systemctl enable ypbind
    3. check :
      1. ll /home
      2. yptest : 1 test fail
      3. ypwhich : tumaz
    4. Logout and login with your own account su Delete tmp_account : userdel --remove tmp_account It's okay to receive error message:
      userdel: tmp_account mail spool (/var/mail/tmp_account) not found 
      userdel: tmp_account home directory (/home/tmp_account) not found
  7. Install GPU driver

    1. Set the text mode as default (since the NVIDIA driver cannot be installed while X window is running) systemctl set-default multi-user.target
    2. Reboot.
    3. su
    4. Install dkms : apt -y install dkms
    5. Disable nouveau : Create file /etc/modprobe.d/blacklist-nouveau.conf with content:
      blacklist nouveau
      options nouveau modeset=0
    6. Apply system changes update-initramfs -u
    7. Reboot.
    8. su
    9. Check nouveau is disabled : lsmod | grep nouveau

      This should print nothing.

    10. Install nvidia dirver

      1. Install :
        su
        sh /work1/shared/spock/package/cuda/cuda_12.1.0_530.30.02_linux.run --silent --driver
      2. Validate with cat /proc/driver/nvidia/version:
        NVRM version: NVIDIA UNIX x86_64 Kernel Module  530.30.02
        GCC version:  gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04)
      3. Copy the default profile files.

        cp /work1/shared/spock/init_script/*.sh  /etc/profile.d/
        cp /work1/shared/spock/init_script/*.csh /etc/profile.d/
        
        cp /work1/shared/spock/etc/rc.local /etc/
        chmod +x /etc/rc.local
      4. Change GPU settings [login node only] Edit /etc/rc.local as follows
        1. Comment out the line /usr/bin/nvidia-persistenced --verbose
        2. Comment out the line nvidia-cuda-mps-control -d
        3. Replace nvidia-smi -i 0 -c EXCLUSIVE_PROCESS by nvidia-smi -i 0 -c PROHIBITED
      5. Reboot
    11. Check nvidia-smi
      NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1
  8. NTP client

    1. su
    2. apt -y install ntp
    3. systemctl status ntp
    4. systemctl enable ntp
  9. TORQUE

    1. Install the required packages
      apt -y install libnuma-dev
      apt -y install tcl-dev tk-dev
      apt -y install libntirpc-dev
      sh /work1/shared/spock/package/torque/src/torque-3.0.6/spock_library_set.sh
    2. Compile and install from source code.

      cd /work1/shared/spock/package/torque/src/torque-3.0.6
      # WARNING: do NOT run "spock_Install.sh" in parallel (i.e., install one node at a time)
      # [Login node ] uncomment "--enable-server"
      # [Computing nodes] comment "--enable-server"
      sh spock_Install.sh >& log.spockXX
      cd ../../etc
      cp pbs_spock /etc/init.d/pbs
      ln -s /etc/init.d/pbs /etc/systemd/system/
      
      cp pbs.conf /etc/
      # [Login node only]: edit "pbs.conf" to set "start_server=1" and "start_mom=0"
      cp nodes_spock /var/spool/TORQUE/server_priv/nodes
      systemctl enable pbs 
      
      source /etc/profile.d/torque.sh
      cd ../src/torque-3.0.6/
      ./torque.setup root
      killall pbs_server
      systemctl start pbs # This error message is fine: "LOG_ERROR::No such file or directory (2) in read_config, fstat: config"
      systemctl status pbs
    3. Check cat /var/spool/TORQUE/pbs_environment : LANG=en_US.utf-8
    4. Setup overcommit-ratio and Disable overcommit-memory in crontab
      1. cp /work1/shared/spock/helper_script/disable_memory_overcommit.sh /root/
      2. Edit crontab with crontab -e and add a new line:
        @reboot /usr/bin/sh /root/disable_memory_overcommit.sh 1> /tmp/disable_memory_overcommit.log 2>&1
  10. [Optional] [Login node only] Create the SSH key of root [Testing]

    ssh-keygen -t rsa
    cd ~/.ssh
    cp id_rsa.pub authorized_keys
    cp id_rsa* authorized_keys /work1/shared/spock/ssh_root/
  11. InfiniBand

    ref. https://docs.nvidia.com/networking/display/MLNXOSv3105002/Getting+Started#heading-RerunningtheWizard

    1. Check hardware lspci | grep Mellanox
      01:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
    2. Install necessary package
      apt -y install libsasl2-dev  libldap2-dev libssl-dev 
    3. Install driver
      1. su
      2. cd /work1/shared/spock/package/ib/adaptor/driver/MLNX_OFED_LINUX-5.9-0.5.6.0-ubuntu22.04-x86_64
      3. ./mlnxofedinstall
        Device #1:
        ----------
        Device Type:      ConnectX6
        Part Number:      MCX653105A-HDA_Ax
        Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
        PSID:             MT_0000000223
        PCI Device Name:  01:00.0
        Base GUID:        0c42a10300ef2a1a
        Versions:         Current        Available
        FW             20.34.1002     20.36.1010
        PXE            3.6.0700       3.6.0901
        UEFI           14.27.0014     14.29.0014
        Status:           Up to date
        ---------
      4. /etc/init.d/openibd restart
      5. reboot
    4. Check
      1. su
      2. ibstatus
        Infiniband device 'mlx5_0' port 1 status:
                default gid:     fe80:0000:0000:0000:0c42:a103:00ef:2a1a
                base lid:        0xffff
                sm lid:          0x0
                state:           4: ACTIVE
                phys state:      5: LinkUp
                rate:            200 Gb/sec (4X HDR)
                link_layer:      InfiniBand
      3. cat /etc/security/limits.conf
        * soft memlock unlimited
        * hard memlock unlimited
      4. systemctl status openibd
        Active: active (exited)
      5. systemctl is-enabled openibd
        enabled
      6. systemctl status opensmd
        Active: inactive (dead)
      7. systemctl is-enabled opensmd
        disabled
      8. hca_self_test.ofed
        ---- Performing Adapter Device Self Test ----
        Number of CAs Detected ................. 1
        PCI Device Check ....................... PASS
        Kernel Arch ............................ x86_64
        Host Driver Version .................... MLNX_OFED_LINUX-5.9-0.5.6.0 (OFED-5.9-0.5.6): 5.15.0-69-generic
        Host Driver RPM Check .................. PASS
        Firmware on CA #0 HCA .................. v20.36.1010
        Host Driver Initialization ............. PASS
        Number of CA Ports Active .............. 1
        Port State of Port #1 on CA #0 (HCA).....  UP 4X HDR (InfiniBand)
        Error Counter Check on CA #0 (HCA)...... PASS
        Kernel Syslog Check .................... PASS
        Node GUID on CA #0 (HCA) ............... 0c:42:a1:03:00:ef:2a:1a
        ------------------ DONE ---------------------
      9. ibdev2netdev -v | grep -i MCX
        0000:01:00.0 mlx5_0 (MT4123 - MCX653105A-HDAT) ConnectX-6 VPI adapter card, HDR IB (200Gb/s) and 200GbE, single-port QSFP56
                                                                                                                   fw 20.36.1010 port 1 (ACTIVE) ==> ibp1s0 (Down)
      10. IB connection and band width test.
        1. Computing nodes -> Login node On spock00
          ib_write_bw -aF

          On spockXX

          ib_write_bw -aF spock00
          ************************************
          * Waiting for client to connect... *
          ************************************
          ---------------------------------------------------------------------------------------
          RDMA_Write BW Test
          Dual-port       : OFF          Device         : mlx5_0
          Number of qps   : 1            Transport type : IB
          Connection type : RC           Using SRQ      : OFF
          PCIe relax order: ON
          ibv_wr* API     : ON
          CQ Moderation   : 100
          Mtu             : 4096[B]
          Link type       : IB
          Max inline data : 0[B]
          rdma_cm QPs     : OFF
          Data ex. method : Ethernet
          ---------------------------------------------------------------------------------------
          local address: LID 0x02 QPN 0x0027 PSN 0xcb8c4 RKey 0x1fffbe VAddr 0x007f9c96aaa000
          remote address: LID 0x01 QPN 0x0027 PSN 0x560b74 RKey 0x1fffbe VAddr 0x007f0894517000
          ---------------------------------------------------------------------------------------
          #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
          8388608    5000             23452.55            23452.55                  0.002932
          ---------------------------------------------------------------------------------------
        2. Computing nodes <- Login node On spock00
          ib_read_bw -aF

          On spockXX

          ib_read_bw -aF spock00
          ************************************
          * Waiting for client to connect... *
          ************************************
          ---------------------------------------------------------------------------------------
          RDMA_Read BW Test
          Dual-port       : OFF          Device         : mlx5_0
          Number of qps   : 1            Transport type : IB
          Connection type : RC           Using SRQ      : OFF
          PCIe relax order: ON
          ibv_wr* API     : ON
          CQ Moderation   : 100
          Mtu             : 4096[B]
          Link type       : IB
          Outstand reads  : 16
          rdma_cm QPs     : OFF
          Data ex. method : Ethernet
          ---------------------------------------------------------------------------------------
          local address: LID 0x02 QPN 0x0028 PSN 0x593c01 OUT 0x10 RKey 0x1fffbf VAddr 0x007efc3f67f000
          remote address: LID 0x01 QPN 0x0028 PSN 0xbaa0aa OUT 0x10 RKey 0x1fffbf VAddr 0x007f6fd2a85000
          ---------------------------------------------------------------------------------------
          #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
          8388608    1000             23517.75            23517.73                  0.002940
          ---------------------------------------------------------------------------------------
    5. Start mst to make us enable monitor IB adaptor
      systemctl enable mst
      systemctl start mst
      mst status
  12. ssh without password for the root

    cd /work1/shared/spock/ssh_root/
    cp authorized_keys id_rsa* /root/.ssh/
    
    # Verification
    ssh spock00   # "yes" to "continue connecting"
    ssh spockXX   # "yes" to "continue connecting"
    exit
    exit

5. install compilers [Login node only]

  1. Intel compiler

    • [Login node] Install
      1. su
      2. mkdir /software/intel
      3. ln -s /software/intel /opt
      4. cd /work1/shared/spock/package/intel
      5. sh l_BaseKit_p_2023.1.0.46401.sh -a --cli Follow and accept the installation process.
      6. sh l_HPCKit_p_2023.1.0.46346.sh -a --cli Follow and accept inte installation process.
    • [Computing nodes] Link
      1. su
      2. cd /opt
      3. ln -s /software/intel
  2. gcc compiler [skip]

    • [Login node] Install latest version
      1. su
      2. mkdir /software/gcc
      3. cd /work1/shared/spock/package/gcc/gcc-12.2.0
      4. sh ./spock_Install.sh >& log.spock
      5. cd /software/gcc
      6. ln -s /work1/shared/spock/package/gcc ./src
      7. ln -s 12.2.0 default

6. install packages

  1. [Login node only] CUDA

    1. cd /work1/shared/spock/package/cuda
    2. mkdir /software/cuda
    3. sh cuda_12.1.0_530.30.02_linux.run --silent --toolkit --installpath=/software/cuda/12.1
    4. Create default link: ln -s /software/cuda/12.1 /software/cuda/default
  2. [Login node only] Valgrind

    mkdir /software/valgrind
    cd /work1/shared/eureka/package/valgrind/valgrind-3.15.0
    # [Optional] Edit Fish_Install.sh
    sh Fish_Install.sh >& log.spock

    After installation

    cd /software/valgrind
    ln -s /work1/shared/spock/package/valgrind src
    ln -s 3.15.0 default
  3. [Login node only] UCX Library

    • Download latest version [optional]
      mkdir /software/openucx
      mkdir /software/src
      cd /software/openucx/src
      git clone https://github.com/openucx/ucx.git ucx
      1. Installatoin
        cd /software/openucx/src/ucx
        ./autogen.sh
        mkdir build
        cd build
        ../contrib/configure-release --prefix=/software/openucx/ucx-1.15.0_with_mt --enable-mt  #enable  MPI_THREAD_MULTIPLE
        make && make install
  4. [Login node only] OpenMPI

    source /etc/profile.d/intel.sh
    mkdir /software/openmpi
    ln -s /work1/shared/spock/package/openmpi /software/openmpi/src
    cd /software/openmpi/src/openmpi-4.1.5
    # [Optional] Edit spock_Install_with_UCX.sh (remember to un-comment the configuration flags)
    sh spock_Install_with_ucx.sh >& log.spock-all

    After installation

    1. Check ucx
      cd /software/openmpi/4.1.5-ucx_mt-intel-2023.1.0/bin
      objdump -p mpicxx | grep PATH    # see whether /software/openucx/ucx-1.15.0_with_mt/lib is in RPATH
      ldd mpicxx | grep ucx            # see whether dynamic linker can find UCX libraries
    2. make soft link
      cd /software/openmpi
      unlink default                   # optional, if default already existed
      ln -s /software/openmpi/4.1.5-ucx_mt-intel-2023.1.0 default
    3. Check /etc/profile.d/openmpi.sh
    4. Check debugger
      source /etc/profile.d/openmpi.sh
      ompi_info | grep "MCA memchecker" # MCA memchecker: valgrind (MCA v2.1.0, API v2.0.0, Component v3.1.5)
    5. Set the MCA parameters
      1. Edit the configure file /software/openmpi/4.1.1-intel-oneapi/etc/openmpi-mca-params.conf by root (for querying the path for configure file, one can use: ompi_info --params mca all --level 9 | grep mca_param_files). Add the below lines (2021/07/24):
        pml=ucx
        osc=ucx
        btl=^openib

        include only ucx for pm include only ucx for osc exclude openib from btl which works for OpenMPI 4.1.1 and UCX 1.12.0, without giving warning message:

        [eureka01:45901] common_ucx.c:364 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.
  5. [Login node only] Maui [testing] [Problematic on sed and gcc version] Install sed 4.2.2

    cd /work1/shared/spock/package/sed/sed-4.2.2
    sh spock_Install.sh

    Install maui

    cd /work1/shared/spock/package/maui/maui-3.3.1/
    # [Optional] Edit Fish_Install.sh
    sh Fish_Install.sh >& log.spock
    cd etc/
    cp spock_maui.d /etc/init.d/maui.d
    cp maui.sh maui.csh /etc/profile.d/
    systemctl enable maui.d
    
    cd /usr/local/maui

    Edit maui.cfg as follows (an example is put at maui-3.3.1/maui.cfg.eureka)

         RMPOLLINTERVAL 00:00:15
         #BACKFILLPOLICY         FIRSTFIT
         #RESERVATIONPOLICY      CURRENTHIGHEST
         #NODEALLOCATIONPOLICY   MINRESOURCE
    
         # <==== Add by Nelson ====>
         JOBAGGREGATIONTIME      00:00:04
    
         # Backfill
    
         BACKFILLPOLICY          FIRSTFIT
         RESERVATIONPOLICY       NEVER
    
         # Node Allocation
    
         NODEALLOCATIONPOLICY    FIRSTAVAILABLE
    
         # Set Job Flags
         JOBACTIONONNODEFAILURE  CANCEL
         JOBNODEMATCHPOLICY      EXACTNODE
    systemctl start maui.d
    source /etc/profile.d/maui.sh
  6. [Login node only] FFTW

    • FFTW-2
      mkdir /software/fftw
      cd /work1/shared/spock/package/fftw/fftw-2.1.5-revised
      # [Optional] Edit Fish_Install.sh to install in intel or gcc
      sh Fish_Install.sh >& log.spock-intel

      After installation

      cd /software/fftw
      ln -s /work1/shared/spock/package/fftw src
      ln -s 2.1.5-intel-2023.1.0-openmpi-4.1.1-ucx_mt default
    • FFTW-3
      cd /work1/shared/eureka/package/fftw/fftw-3.3.10
      # [Optional] Edit spock_Install.sh
      sh spock_Install.sh >& log.spock-intel

      After installation

      cd /software/fftw
      ln -s 3.3.10-intel-2023.1.0-openmpi-4.1.1-ucx_mt default3
  7. [Ligin node only] HDF5

    mkdir -p /software/hdf5
    cd /work1/shared/spock/package/hdf5/hdf5-1.10.6
    # [Optional] Edit Fish_Install.sh
    sh Fish_Install.sh >& log.spock

    After installation

    cd /software/hdf5
    ln -s /work1/shared/spock/package/hdf5 src
    ln -s 1.10.6 default
  8. [Login node only] GSL

    mkdir -p /software/gsl
    cd /work1/shared/spock/package/gsl/gsl-2.6
    # [Optional] Edit Fish_Install.sh
    sh Fish_Install.sh >& log.spock

    After installation

    cd /software/gsl
    ln -s /work1/shared/spock/package/gsl src
    ln -s 2.6 default
  9. python2

    source /etc/profile.d/openmpi.sh; source /etc/profile.d/intel.sh; source /etc/profile.d/hdf5.sh
    apt -y install python2 python2-dev
    apt -y install python-tk
    cd /work1/shared/spock/package/python2
    python2 get-pip.py
    sh install-python-packages.sh
  10. python3

    apt -y install python3 python3-dev
    apt -y install python3-tk
    apt -y install python3-pip
    cd /work1/shared/spock/package/python3
    sh install-python-packages.sh

    Add /usr/local/bin to PATH by adding a line at the end of /etc/profile

    export PATH=/usr/local/bin:$PATH
  11. Module

    cd /work1/shared/spock/package/module/modules-5.1.1
    make clean
    ./configure
    make
    make install

    After installation

    cp init/profile.sh /etc/profile.d/10-modules.sh
    cp init/profile.csh /etc/profile.d/modules.csh
    source init/bash

    Add /software/intel/oneapi/modulefiles to default module directories by adding the line to the file /usr/local/Modules/etc/initrc

    module use /software/intel/oneapi/modulefiles

    Set up preload module

    ln -s /software/modulefiles/default_modules.sh /etc/profile.d/default_modules.sh

7. Miscellaneous setup

  1. IPMI tool

    1. Install IPMI driver and tool : apt -y install openipmi ipmitool
    2. Check : ipmitool sensor get "CPU Temp."
  2. ffmpeg apt -y install ffmpeg

  3. gnuplot apt -y install gnuplot-x11

  4. screen apt -y install screen

  5. pdsh apt -y install pdsh

  6. locate apt -y install plocate

  7. ClamAV

    apt -y install clamav clamav-daemon
    systemctl stop clamav-freshclam
    freshclam
    systemctl start clamav-freshclam
    systemctl enable clamav-freshclam
  8. X11 server

    apt -y install xorg openbox
  9. CPU usage monitor

    apt -y install sysstat
  10. Image display feh

    apt -y install feh
  11. Disable auto update.

    1. Edit the apt config file at /etc/apt/apt.conf.d/20auto-upgrades as follow.
      APT::Periodic::Update-Package-Lists "0";
      APT::Periodic::Unattended-Upgrade "0";
    2. Apply config
      apt-config dump APT::Periodic::Update-Package-Lists
      apt-config dump APT::Periodic::Unattended-Upgrade

8. Check

  1. CPU burn-in test

    1. Install CPU test program
      apt -y install stress-ng
    2. Run CPU test
      stress-ng --cpu 0 --timeout 30m &
    3. Detect CPU temperature every minute during test
      for i in {1..40}; do ipmitool sensor | grep "CPU Temp."; sleep 1m; done

      AMD Threadripper allows temperature up to 95 degree. And the non-critical upper limit is 85 degree. for spock02 the highest temperature is 82 degree.

  2. GPU burn-in test

    cd /work1/shared/spock/tests/gpu_burn-in/gpu-burn
    ./gpu_burn 1800 # run for 30 minutes

    during the test, watch the gpu temperature shown on screen. For RTX3080Ti, hightest temperature is 93 degree celsius. And the non-critical upper limit is 90 degree. For spock02, the highest temperature is 81 degree.

  3. MPI suit test [Run as regular user]

    1. Download @spock00 git clone https://github.com/open-mpi/mpi-test-suite.git
    2. Compile @spock00 cd mpi-test-suite ./autogen.sh ./configure CC=mpicc make
    3. Run tests cp /work1/shared/tests/mpi_test_suite/run_test.sh ./ qsub -I -lnodes=spockXX:ppn=32 cd {directory of mpi_test_suite} sh run_test.sh >& spockXX.log
    4. Check test result tail spockXX.log # Number of failed tests: 0
  4. gamer test

    1. Single node MPI
      cd /work1/shared/spock/tests/gamer/
    2. Multiple node MPI (with infiniband)
      cd /work1/shared/spock/tests/gamer/
xuanweishan commented 1 year ago

Infiniband Switch

Initialization

  1. Plug both power cables and wait for all system status led bright solid green.

  2. Connect a host PC (e.g., spock00) to the console (RJ-45) port of the switch using the supplied RJ-451-to-DB9 cable + DB9-to-USB cable

  3. Login with the ubuntu PC

    1. Get the USB device name : ls /dev/ttyUSB*

      If there is only one USB device plug on the PC, it would show ttyUSB0

    2. Connect to switch with su privilige screen /dev/ttyUSB0 115200 and press enter twice.
    3. Login:
      Username: admin
      Password: admin
  4. Configuration (Below question will be ask at the first connection)

    Do you want to use the wizard for initial configuration? yes
    Step 1: Hostname? [switch-d79b5a]
    Step 2: Use DHCP on mgmt0 interface? [yes] no
    Step 3: Use zeroconf on mgmt0 interface [no]
    Step 4: Primary IPv4 address and masklen? [0.0.0.0/0] 192.168.0.100/24
    Step 5: Default gateway? 192.168.0.1
    Step 6: Primary DNS server? 140.112.254.4
    Step 7: Domain name?
    Step 8: Enable IPv6? [yes]
    Step 9: Enable IPv6 autoconfig (SLAAC) on mgmt0 interface? [no]
    Step 10: Enable DHCPv6 on mgmt0 interface? [yes] no
    Step 11: Admin password (Must be typed)? #set it the same as spock
    Step 11: Confirm admin password?
    Step 12: Monitor password (Must be typed)? #same as admin password
    Step 12: Confirm monitor password?

    If there is needed to resetup the configure enable config terminal configuration jump-start

  5. Check

    1. System version show version
      Product name:      MLNX-OS
      Product release:   3.8.2102
      Build ID:          #1-dev
      Build date:        2019-11-26 21:48:40
      Target arch:       x86_64
      Target hw:         x86_64
      Built by:          jenkins@c776fa44be2b
      Version summary:   X86_64 3.8.2102 2019-11-26 21:48:40 x86_64
      Product model:     x86onie
      Host ID:           043F72D79B5A
      System serial num: MT2039J30791
      System UUID:       f73a8370-1456-11eb-8000-043f72d00e66
      Uptime:            18h 12m 33.108s
      CPU load averages: 3.11 / 3.05 / 3.01
      Number of CPUs:    4
      System memory:     468 MB used / 7333 MB free / 7801 MB total
      Swap:              0 MB used / 0 MB free / 0 MB total       
    2. mgmt0 interface
      enable
      show interfaces mgmt0
      Interface mgmt0 status:
      Comment         :
      Admin up        : yes
      Link up         : yes
      DHCP running    : no
      IP address      : 192.168.0.100
      Netmask         : 255.255.255.0
      IPv6 enabled    : yes
      Autoconf enabled: no
      Autoconf route  : yes
      Autoconf privacy: no
      DHCPv6 running  : no
      IPv6 addresses  : 1
  6. Enable OpenSM

    1. enable
    2. configure terminal
    3. ib smnode switch-d79b5a enable
    4. show ib sm
      enable
    5. no configure
  7. Logout and exit with [CTRL + A] and [CTRL + K]

Rerun initialization

  1. Login to switch (w/ console port or ssh)
  2. enable
  3. configure terminal
  4. configuration jump-start

Enable OpenSM

  1. enable
  2. configure terminal
  3. ib smnode switch-d79b5a enable
  4. show ib sm
    enable
  5. on configure

SSH

Unable to negotiate with 192.168.0.100 port 22: no matching key exchange method found. Their offer: diffie-hellman-group14-sha1 Unable to negotiate with 192.168.0.100 port 22: no matching host key type found. Their offer: ssh-rsa Above error message shown while we try to ssh to the switch with ubuntu 22.04

  1. Add lines at the end of the file etc/ssh/ssh_config
    KexAlgorithms=+diffie-hellman-group14-sha1
    HostKeyAlgorithms=+ssh-rsa
  2. Restart ssh service service ssh restart
xuanweishan commented 1 year ago

Nas

  1. item1:
    1. 1123