Open xuanweishan opened 2 years ago
Install system in raid 1 ref. https://www.thegeekdiary.com/how-to-install-centos-rhel-7-on-raid-1-partition/
- Installation system
- Installation destination
- Choose both disks we want to build raid.
- Choose
I will configure partitioning
- Click
create them automatically.
- Choose
/home
and click-
to delete it.- Select
Device Type
to beRAID
, chooseRAID LEVEL
toRAID1
and write your personal label, such asmd0
.- Similarly, you can set the Device type and labels for
swap
and/boot
partitions- Installation summary:
- [x] Additional Development
- [x] Compatibility Libraries
- [x] Development Tools
- [x] Email Server
- [x] Emacs
- [x] File and Storage Server
- [x] Hardware Monitoring Utilities
- [x] Identity Management Server
- [x] Infiniband Support
- [x] KDE
- [x] Legacy X Window System Compatibility
- [x] Network File System Client
- [x] Platform Development
- [x] Python
- [x] Technical Writing
- [x] Security Tools
- [x] System Administration Tools
- Set up network
Edit/etc/sysconfig/network-scripts/ifcfg-XXX
, where XXX is usually "enpXXX"
BOOTPROTO=static ONBOOT=yes IPADDR=192.168.0.200 GATEWAY=192.168.0.1 NETMASK=255.255.255.0 DNS1=140.112.254.4
- NFS client
- Mount all remote folders
Do not mount/software
mount /software; mount /home; mount /work1; mount /project; mount /projectX; mount /projectY; mount /projectZ
- Install the CUDA driver
- Copy initialization scripts
cp /work1/shared/spock/init_script/*.sh /etc/profile.d/ cp /work1/shared/spock/init_script/*.csh /etc/profile.d/
cp /work1/shared/spock/etc/rc.local /etc/rc.d/
chmod +x /etc/rc.d/rc.local
```
Edit `/etc/rc.d/rc.local` as follows
1. Comment out the line `/usr/bin/nvidia-persistenced --verbose`
2. Comment out the line `nvidia-cuda-mps-control -d`
3. Replace `nvidia-smi -i 0 -c EXCLUSIVE_PROCESS` by `nvidia-smi -i 0 -c PROHIBITED`
TORQUE
cd /work1/shared/spock/package/torque/src/torque-3.0.6
Login node only: edit Fish_Install.sh
to enable —enable-server
Fish_Install.sh
in parallel (i.e., install one node at a time)
sh Fish_Install.sh >& log.spockXX
cd ../../etc
cp pbs /etc/init.d/
Edit pbs.conf
to set start_server=1
and start_mom=0
cp pbs.conf /etc/
cp nodes_spock /var/spool/TORQUE/server_priv/nodes
systemctl enable pbs
source /etc/profile.d/torque.sh
cd ../src/torque-3.0.6/
./torque.setup root
killall pbs_server
systemctl start pbs # This error message is fine: "LOG_ERROR::No such file or directory (2) in read_config, fstat: config"
systemctl status pbs
[Optional] Create the SSH key of root
ssh-keygen -t rsa
cd ~/.ssh
cp id_rsa.pub authorized_keys
cp id_rsa* authorized_keys /work1/shared/spock/ssh_root/
NFS server
systemctl enable nfs
systemctl start nfs
cp /work1/shared/spock/etc/exports /etc/
exportfs -ra
showmount -e spock00 # /software 192.168.0.0/24
Comment out spock00:/software
in /etc/fstab
CUDA
Driver
It should have been installed when following [[System Installation: Computing Node | System Installation: Computing Node]]
Libraries and samples
mkdir /software/cuda /software/cuda/12.1
ln -s /software/cuda/12.1 /software/cuda/default
Install 12.1
cd /work1/shared/spock/package/cuda/
sh cuda_12.1.0_530.30.12_linux.run --silent --toolkit --installpath=/software/cuda/12.1
WIP
[optional] cuDNN
[Optional] Download the latest version
https://developer.nvidia.com/cudnn --> Download cuDNN
- Installation (the following example adopts CUDA 10.1)
CUDNN_TMP=/work1/fish/Code_Backup/GPU_Cluster_Setup/eureka/package/cudnn/cudnn-10.1-linux-x64-v7.6.5.32 CUDA_TMP=/software/cuda/10.1 cp ${CUDNN_TMP}/include/cudnn.h ${CUDA_TMP}/include cp ${CUDNN_TMP}/lib64/libcudnn* ${CUDA_TMP}/lib64 chmod a+r ${CUDA_TMP}/include/cudnn.h ${CUDA_TMP}/lib64/libcudnn*
- Test Log in to a computing node first Do NOT use GNU compiler later than 8
cd /tmp export CUDA_PATH=/software/cuda/10.1 export PATH=$CUDA_PATH/bin:$PATH export LD_LIBRARY_PATH=$CUDA_PATH/lib64:$LD_LIBRARY_PATH cp -r /work1/fish/Code_Backup/GPU_Cluster_Setup/eureka/package/cudnn/cudnn_samples_v7 . cd cudnn_samples_v7/mnistCUDNN make clean && make -j 16 ./mnistCUDNN # Test passed! cd ../../ rm -rf cudnn_samples_v7
Test
cd /software/cuda/12.0/NVIDIA_CUDA-10.2_Samples
./1_Utilities/deviceQuery/deviceQuery
Verify that
./1_Utilities/bandwidthTest/bandwidthTest
Verify that
[Optional] Reset the default mode to graphical.target
(a.k.a. runlevel 5)
systemctl set-default graphical.target
systemctl get-default # graphical.target
Intel compiler
Create directory and link for intel compiler.
mkdir /software/intel
ln -s /software/intel /opt
1.oneAPI Basic & HPC toolkit
Install Basic toolkit
cd /software/intel
cp /work1/shared/spock/package/intel/l_BaseKit_p_2023.0.0.2553.sh ./
sudo bash l_BaseKit_p_2023.0.0.25537.sh
Install HPC toolkit
cp /work1/shared/spock/package/intell_HPCKit_p_2023.0.0.25400.sh ./
sudo bash l_HPCKit_p_2023.0.0.25400.sh
After installation
mv /opt/intel/licenses /opt/intel/.pset /software/intel
rm -rf /opt/intel
cd /opt
ln -s /software/intel
cd /software/intel
ln -s /work1/shared/spock/package/intel src
Check /etc/profile.d/intel.sh
and, if necessary, replace opt
by software
Valgrind
mkdir -p /software/valgrind
cd /work1/shared/spock/package/valgrind/valgrind-3.15.0
# [Optional] Edit Fish_Install.sh
sh Fish_Install.sh >& log.spock
After installation
cd /software/valgrind
ln -s /work1/shared/spock/package/valgrind src
ln -s 3.15.0 default
Check /etc/profile.d/valgrind.sh
OpenMPI
mkdir -p /software/openmpi
cd /software/openmpi/src/openmpi-4.1.1
# [Optional] Edit Fish_Install_with_UCX.sh (remember to un-comment the configuration flags)
sh Fish_Install_with_UCX.sh >& log.spock-intel
After installation
UCX
libraries
cd /software/openmpi/4.1.1-intel-oneapi/bin
objdump -p mpicxx | grep PATH # see whether /software/openucx/ucx-1.12.0/lib is in RPATH
ldd mpicxx | grep ucx # see whether dynamic linker can find UCX libraries
cd /software/openmpi
ln -s /work1/shared/spock/package/openmpi src
unlink default # optional, if default already existed
ln -s /software/openmpi/4.1.1-intel-oneapi default
/etc/profile.d/openmpi.sh
ompi_info | grep "MCA memchecker" # MCA memchecker: valgrind (MCA v2.1.0, API v2.0.0, Component v3.1.5)
MCA
parameters [Optional]
/software/openmpi/4.1.1-intel-oneapi/etc/openmpi-mca-params.conf
by root (for querying the path for configure file, one can use: ompi_info --params mca all --level 9 | grep mca_param_files
). Add the below lines (2021/07/24):
pml=ucx # include only ucx for pml
osc=ucx # include only ucx for osc
btl=^openib # exclude openib from btl
which works for OpenMPI 4.1.1
and UCX 1.12.0
, without giving warning message:
[eureka01:45901] common_ucx.c:364 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.
openib
if UCX
library is installed.MCA
parameters by configure file (10. How do I set the value of MCA parameters?, 4. Files)Maui
cd /work1/fish/Code_Backup/GPU_Cluster_Setup/spock/package/maui/maui-3.3.1
# [Optional] Edit Fish_Install.sh
sh Fish_Install.sh >& log.spock
cd etc/
cp maui.d /etc/init.d/
cp maui.sh maui.csh /etc/profile.d/
systemctl enable maui.d
cd /usr/local/maui
Edit maui.cfg
as follows (an example is put at maui-3.3.1/maui.cfg.eureka
)
RMPOLLINTERVAL 00:00:15
#BACKFILLPOLICY FIRSTFIT
#RESERVATIONPOLICY CURRENTHIGHEST
#NODEALLOCATIONPOLICY MINRESOURCE
# <==== Add by Nelson ====>
JOBAGGREGATIONTIME 00:00:04
# Backfill
BACKFILLPOLICY FIRSTFIT
RESERVATIONPOLICY NEVER
# Node Allocation
NODEALLOCATIONPOLICY FIRSTAVAILABLE
# Set Job Flags
JOBACTIONONNODEFAILURE CANCEL
JOBNODEMATCHPOLICY EXACTNODE
systemctl start maui.d
source /etc/profile.d/maui.sh
Other packages
screen
, pdsh
yum -y install screen
yum -y install pdsh
FFTW
mkdir -p /software/fftw
cd /work1/shared/spock/package/fftw/fftw-2.1.5-revised
# [Optional] Edit Fish_Install.sh
sh Fish_Install.sh >& log.spock-intel
After installation
cd /software/fftw
ln -s /work1/shared/spock/package/fftw src
ln -s 2.1.5-intel default
HDF5
mkdir -p /software/hdf5
cd /work1/shared/spock/package/hdf5/hdf5-1.10.6
# [Optional] Edit Fish_Install.sh
sh Fish_Install.sh >& log.spock
After installation
cd /software/hdf5
ln -s /work1/shared/spock/package/hdf5 src
ln -s 1.10.6 default
GSL
mkdir -p /software/gsl
cd /work1/shared/spock/package/gsl/gsl-2.6
# [Optional] Edit Fish_Install.sh
sh Fish_Install.sh >& log.spock
After installation
cd /software/gsl
ln -s /work1/shared/spock/package/gsl src
ln -s 2.6 default
gnuplot
mkdir -p /software/gnuplot
cd /work1/shared/spock/package/gnuplot/gnuplot-5.2.8
# [Optional] Edit Fish_Install.sh
sh Fish_Install.sh >& log.spock
After installation
cd /software/gnuplot
ln -s /work1/shared/spock/package/gnuplot src
ln -s 5.2.8 default
Latest GNU compiler
wget https://ftp.gnu.org/gnu/gcc/gcc-${GCC_VERSION}/gcc-${GCC_VERSION}.tar.gz
mkdir -p /software/gcc
cd /work1/shared/spock/package/gcc/gcc-9.3.0
# [Optional] Edit Fish_Install.sh
sh Fish_Install.sh >& log.spock
Ref: https://gcc.gnu.org/install/configure.html
https://gcc.gnu.org/install/
https://gist.github.com/nchaigne/ad06bc867f911a3c0d32939f1e930a11 After installationcd /software/gcc ln -s /work1/shared/spock/package/gcc src ln -s 9.3.0 default
UCX
library
mkdir -p /software/openucx
cd /software/openucx/
git clone https://github.com/openucx/ucx.git ucx
cd /software/openucx/ucx
./autogen.sh
mkdir build
cd build
../contrib/configure-release --prefix=/install_path_for_ucx --enable-mt #enable MPI_THREAD_MULTIPLE
make && make install
Ref: https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX#%20UCX%20installation
Ref: https://github.com/openucx/ucx/issues/5284 EnableMPI_THREAD_MULTIPLE
allows using parapllelyt
withUCX
, without the need to excludeucx
frompml
in the script.
Miscellaneous
Language
Add the following lines to /etc/environment
LANG=en_US.utf-8
LC_ALL=en_US.utf-8
Check
applications
> utilities
> disks
It will show two disks with the same partition settings and 4 RAID-1 array under.VGA switch -> off
IMPI switch -> left(default)
PSU(PHANDEKS ) hybrid -> press down
Update BIOS
Main
-> BIOS Information
-> Version
if Version = 1003 x64, then skip the steps in `Update BIOS
delete
and F2
during booting to get in to BIOS.Tool
-> ASUS EZ Flash 3 Utility
PRO_WS_WRX80E-SAGE_SE_WIFI-ASUS-1003(1)
PRO-WS-WRX80E-SAGE-SE-WIFI-ASUS-1003.CAP
Yes
save change and exit
or press F10
.Main
-> BIOS Information
-> Version 1003 x64Set up RAM frequency
Ai Tweaker
-> Choose D.O.C.P
D.O.C.P
-> Choose D.O.C.P DDR4-3200 16-18-18-38-1.35V
Main
-> Total Memory
: 262144 MB -> Speed
: 3200 MHzEnable NUMA
Advanced
-> AMD CBS
-> DF Common Option
-> Memory Addressing
NUMA nodes persocket
-> Choose NPS2
Set up boot disk
Boot
> Choose USB to bootRef. https://github.com/calab-ntu/eureka/wiki/System-Installation%3A-Computing-Node#2-install-centos-7
Boot up > Test this media & install CentOS 7
Installation summary:
Select INTEL SSD
Partitioning
-> I will configure partitioning
-> Done
-> Click here to create them automatically
-> Verify that the total space is about 500 GB (**** GiB in my test)
-> /home
: click -
to remove it since we will mount a remote home
/
: 448 GiB
swap
: 16 GiB
-> 576.99 M left.
-> Done
-> Accept Changes
Ethernet
-> ON
Host name
(on the bottom left corner): spockXX -> ApplyThe installation will take ~10 min
Accept the EULA agreement
-> FINISH CONFIGURATION
Location Services
: OFF
Time Zone
: Taipei, Taiwan
Connect Your Online Accounts
: Skip
About You
: (this account will be removed later)
Full Name
: tmp_account
Username
: tmp_account
Password
: ****uname -r # 3.10.0-1062.el7.x86_64
/etc/sysconfig/network-scripts/ifcfg-XXX
, where XXX
is usually "enpXXX"
BOOTPROTO=static
ONBOOT=yes
IPADDR=192.168.0.2XX # Replace XX by the node number (node number = 00 for the
login node)
GATEWAY=192.168.0.1
NETMASK=255.255.255.0
DNS1=140.112.254.4
virbr0
--> It prevents MPI from working properly
--> Ref: https://www.thegeekdiary.com/centos-rhel-67-how-to-disable-or-delete-virbr0-interface/
virsh net-autostart default --disable
Disable SELinux and firewall (necessary for torque)
setenforce 0
sed -i 's/^SELINUX=.*/SELINUX=disabled/g' /etc/selinux/config # or just edit "/etc/selinux/config" directly
systemctl stop firewalld
systemctl disable firewalld
sestatus # Check the SELinux status
# --> Before reboot, it should show
# SELinux status: enabled
# Current mode: permissive
# --> After reboot, it should show
# SELinux status: disabled
ping 192.168.0.200
ifconfig: No virbr0 interface
Remove the temporary user
userdel -r tmp_account
NFS client
scp 192.168.0.200:/work1/shared/pluto/etc/hosts /etc/hosts
ssh pluto00 cat /work1/shared/pluto/etc/fstab >> /etc/fstab
mkdir /software /work1 /project /projectX /projectY /projectZ
Check the accessibility of the target NFS servers:
showmount -e pluto00 # /software 192.168.0.0/24
showmount -e tumaz # /home 192.168.0.0/24
showmount -e ironman # /volume1/gpucluster1 192.168.0.0/24
# /volume2/gpucluster2 192.168.0.0/24
# /volume3/gpucluster3 192.168.0.0/24
showmount -e eater # /volume1/gpucluster3 192.168.0.0/24
# /volume2/gpucluster4 192.168.0.0/24
Mount all remote folders
mount /software; mount /home; mount /work1; mount /project; mount /projectX; mount /projectY; mount /projectZ
df # Check if all folders have been mounted
CUDA
Update system
yum -y update
Set the text mode as default (since the NVIDIA driver cannot be installed while X window is running)
systemctl set-default multi-user.target
Check it with
systemctl get-default # It should show "multi-user.target" instead of "graphical.target"
Reboot
Install EPEL and ELRepo
# [Optional] yum may fail due to incorrect local system time (which causes failure of certification)
# --> Solution 1: reset system time directly with, for example, 'date -s "2020-03-22 19:06:26"'
# --> Solution 2: enable NTP by following step 3-(5) below
yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum -y install https://www.elrepo.org/elrepo-release-7.0-4.el7.elrepo.noarch.rpm
Install dkms
yum -y install dkms
Disable the display driver nouveau
Add the following line to the end of GRUB_CMDLINE_LINUX
in /etc/default/grub
rd.driver.blacklist=nouveau nouveau.modeset=0
Alternatively, one can simply copy it from
cp /work1/shared/pluto/etc/grub /etc/default/grub
Execute the following
grub2-mkconfig -o /boot/grub2/grub.cfg
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist.conf
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
dracut /boot/initramfs-$(uname -r).img $(uname -r)
grub2-mkconfig -o /boot/grub2/grub.cfg
Reboot
Verify that the nouveau driver is not loaded
lsmod | grep nouveau # It should print nothing
Install the CUDA driver
[Optional] Disable persistence and MPS daemons if you are upgrading an existing driver
nvidia-smi -pm 0
echo quit | nvidia-cuda-mps-control
Install with
sh /work1/shared/pluto/package/cuda/cuda_12.1.0_530.30.02_linux.run --silent --driver
Validate with
cat /proc/driver/nvidia/version # NVIDIA driver: 530.30.02 gcc: 4.8.5
Copy initialization scripts
cp /work1/shared/pluto/init_script/*.sh /etc/profile.d/
cp /work1/shared/pluto/init_script/*.csh /etc/profile.d/
cp /work1/shared/eureka/etc/rc.local /etc/rc.d/
chmod +x /etc/rc.d/rc.local
[Optional] Comment out the line nvidia-cuda-mps-control -d
in /etc/rc.d/rc.local
to disable MPS
on nodes running tensorflow
on pre-Volta GPUs (e.g., eureka32
and eureka33
). It is because
stream callbacks are not supported on pre-Volta MPS clients;
see https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf.
[Optional] Edit /etc/rc.d/rc.local
to make cuda double GPU node could identify both GPUs
export CUDA_VISIBLE_DEVICE=0,1
nvidia-smi -i 0,1 -c EXELUSIVE_PROCESS
Reboot
Validate the following
nvidia-smi -q | grep Persistence # Enabled
nvidia-smi -q | grep "Driver Version" # 525.85.12
nvidia-smi -q | grep "CUDA Version" # 12.0
nvidia-smi -q | grep "Compute Mode" # Exclusive_Process
ps -ef | grep mps # nvidia-cuda-mps-control -d
NIS client
Install
yum -y install ypbind
Configure
setup # [Authentication configuration] -> [Use NIS]
Domain: tumaz.gpucluster.calab
IP: tumaz
systemctl is-enabled ypbind # enabled
Do not create the perl5 directory by setting PERL_HOMEDIR=0
in
/etc/profile.d/perl-homedir.sh
Reboot
Validate the following
yptest # "1 tests failed"
# "Test 9: yp_all" should list all accounts
# "Test 3" may fail with the message "WARNING: No such key in map (Map passwd.byname, key nobody)"
ypwhich # tumaz.gpucluster.calab
ls -l /home # Show correct user and group names instead of just UID and GID
# Log in using an existing user account
NTP client
yum -y install ntp
systemctl start ntpd
systemctl enable ntpd
systemctl is-enabled ntpd # enabled
TORQUE
Install the required packages
yum -y install tcl-devel tk-devel
Install torque
cd /work1/shared/pluto/package/torque/src/torque-3.0.6
# WARNING: do NOT run "Fish_Install.sh" in parallel (i.e., install one node at a time)
sh Fish_Install.sh >& log.plutoXX
cd ../../etc
cp pbs /etc/init.d/
# Login node only: edit "pbs.conf" to set "start_server=1" and "start_mom=0"
cp pbs.conf /etc/
cp nodes_pluto /var/spool/TORQUE/server_priv/nodes
systemctl enable pbs
source /etc/profile.d/torque.sh
cd ../src/torque-3.0.6/
./torque.setup root
killall pbs_server
systemctl start pbs # This error message is fine: "LOG_ERROR::No such file or directory (2) in read_config, fstat: config"
systemctl status pbs
Check
cat /var/spool/TORQUE/pbs_environment # LANG=en_US.utf-8
InfiniBand
Check hardware
lspci -v | grep Mellanox
# 09:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
# Subsystem: Mellanox Technologies Device 0003
yum -y remove opa-libopamgt opa-address-resolution
cd /work1/shared/pluto/package/ib/adaptor/driver/MLNX_OFED_LINUX-5.7-1.0.2.0-rhel7.9-x86_64
./mlnxofedinstall -h # Print usage
./mlnxofedinstall # update firmware automatically [use this until new firmware release]
./mlnxofedinstall --fw-image-dir /tmp/my_fw_bin_files # specify the firmware (see above for the currently used version)
./mlnxofedinstall --without-fw-update # no firmware update
./mlnxofedinstall --add-kernel-support
dracut -f
Miscellaneous setup
IPMI driver and tool
yum -y install OpenIPMI ipmitool
Enable ipmi serviec
systemctl enable ipmi.service
Check CPU temperature
ipmitool sensor get "CPU Temp."
gnuplot
, htop
yum -y install gnuplot htop
ssh without password for the root
cd /work1/shared/pluto/ssh_root/
cp authorized_keys id_rsa* /root/.ssh/
# Verification
ssh pluto00 # "yes" to "continue connecting"
ssh plutoXX # "yes" to "continue connecting"
exit
exit
Intel compiler
cd /opt
ln -s /software/intel
rm -rf ./rh # not related to Intel, actually
ffmpeg
rpm --import http://li.nux.ro/download/nux/RPM-GPG-KEY-nux.ro
rpm -Uvh http://li.nux.ro/download/nux/dextop/el7/x86_64/nux-dextop-release-0-5.el7.nux.noarch.rpm
yum -y install ffmpeg ffmpeg-devel
Time stamp of command history
su
export HISTTIMEFORMAT='%d/%m/%y %T '
to the end of file /etc/profile
source /etc/profile
history
Python
Python2
[For current system]
sh /work1/shared/pluto/package/python2/install-python-packages.sh
[Optional] If the installation of mpi4py fails (which somehow freezes the screen and causes failure of yt after reboot),
# follow the steps below:
pip uninstall mpi4py
cd /usr/lib64/python2.7/site-packages
rm -rf yt yt-3.5.1.dist-info
pip install yt
pip install mpi4py
# check dependency
pip check ipython numpy cython jupyter h5py scipy astropy matplotlib yt mpi4py # No broken requirements found.
# check MPI
su YOUR_ACCOUNT
mpirun -n 16 python -m mpi4py.bench helloworld # Hello, World! I am process ? of 16 on plutoXX.
# check yt
cd /tmp
cp -r /work1/shared/eureka/test/yt ./test-yt
cd test-yt
sh test.sh # It should generate two PNG images Data_000000_Projection_z_density.png & Data_000001_Projection_z_density.png
cd ..
rm -rf test-yt
# check HDF5
python -c 'import h5py; print( "%s"%h5py.version.hdf5_version )' # 1.0.6
cd /work1/shared/pluto/python3/Python-3.9.10
./configure --enable-optimizations --enable-shared
make altinstall
cp libpython3.9.so.1.0 /usr/lib64/
python3.9
rm /usr/bin/python3
ln -s /usr/local/bin/python3.9 /usr/bin/python3
pip3 install -U pip
pip3 install --ignore-installed pyparsing
pip3 install --upgrade dnspython
pip3 install --upgrade pyudev
pip3 install --ignore-installed --upgrade python-ldap
pip3 install --upgrade ipython
pip3 install --upgrade numpy
pip3 install --upgrade cython
pip3 install --upgrade six
pip3 install --upgrade urllib3
pip3 install idna==2.10
pip3 install --upgrade certifi
pip3 install chardet==4.0.0
pip3 install --upgrade jupyter
export HDF5_DIR=/software/hdf5/default
pip3 install --no-binary=h5py h5py
pip3 install --upgrade scipy
pip3 install --upgrade astropy
pip3 install --upgrade matplotlib
pip3 install --upgrade gitpython
pip3 install --upgrade pandas
cp -r /work1/shared/pluto/package/python3/yt /usr/local/lib/python3.9/
cd /usr/local/lib/python3.9
pip3 install -e .
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/software/gcc/default/lib64
python3 -c "import yt"
python3 -m pip install -U yt
yt version
pip3 install mpi4py
pip3 install dulwich
pip3 install girder-client
Check
NUMA
lscpu | grep NUMA # It should show **2** NUMA nodes
Memory frequency [Need to be confirmed]
dmidecode -t memory | grep Speed # It should show **Speed: 3200 MT/s; Configured Memory Speed: 2666 MT/s**
[Optional]Queuing system and GAMER performance
# log in to spock00
su your_account
qsub -I -lnodes=spockXX:ppn=16
cd /tmp
cp -r /work1/shared/eureka/test/gamer/bin/template/ tmp-gamer
cd tmp-gamer
mpirun -map-by ppr:4:socket:pe=4 --report-bindings ./gamer
awk 'NR>1 {print $7}' Record__Performance # Performance should be ~4.2e7
cd ..
rm -rf ./tmp-gamer/
exit
[Optional] CPU/GPU temperature
stress-ng
/work1/xuanshan/gpu-burn/run.sh
Malware scanner ClamAV
ClamVA
: yum install -y clamav clamd clamav-update
freshclam
clamscan /etc/* /opt/* /boot/* /usr/* /tmp/* /sys/* /run/* /root/* /var/*
----------- SCAN SUMMARY -----------
Engine version: 0.103.5
Scanned directories: 5
Scanned files: 2774
Infected files: 0
Data scanned: 565.18 MB
Data read: 668.04 MB (ratio 0.85:1)
Time: 50.516 sec (0 m 50 s)
Start Date: 2022:05:12 10:21:31
End Date: 2022:05:12 10:22:22
If Infected files
is 0, then pass.
Ref:
VGA switch -> off
IPMI switch -> left (default)
PSU(PHANDEKS ) hybrid -> press down
Change cooling fan header from CPU_OPT
to CHA_FAN1
Plug the micro USB plug off the CPU pump.
Bootup machine with BIOS flash disk plugged in.
If the machine is boot for the first time, it would ask if you want to initial the CPU config. Press
Y
to confirm initial.
Check BIOS version an update
delete
or F2
during booting.
if Version = 1106 x64, then skip the steps in
Update BIOS
delete
or F2
during booting to get in to BIOS.Tool
-> ASUS EZ Flash 3 Utility
PRO_WS_WRX80E-SAGE_SE_WIFI-ASUS-1106
PRO-WS-WRX80E-SAGE-SE-WIFI-ASUS-1106.CAP
Yes
save changes and exit
or press F10
.DRAM overclock setting
Ai Tweaker
-> Ai overclock Tuner
-> Choose D.O.C.P
D.O.C.P
-> Choose D.O.C.P DDR4-3200 16-18-18-38-1.35V
F10
reboot.Main
-> Total Memory
: 262144 MB
-> Speed : 3200 MHz
Enable NUMA
Advanced
-> AMD CBS
-> DF Common Option
-> Memory Addressing
NUMA nodes persocket
-> Choose NPS2
F10
Reboot
Download ubuntu 22.04 from https://www.ubuntu-tw.org/modules/tinyd0/ Make bootable USB disk with rufus.
Set up boot disk in BIOS
delete
or F2
during booting.Boot
> Choose USB to boot.F10
to reboot.Install Ubuntu 22.04
Try or install ubuntu server
English
-> Done
English (US)
English (US)
-> Done
Done
Continue without network
Done
Leave the field empty.
Done
Don't change the url
Custom storage layout
-> Done
Use As Boot Device
on both disks/boot
Add GPT partition
Leave unformat
-> Create
Create software RAID
:
md0
Raid 1
ext4
/boot
-> Create
/
: Same steps as /boot
with changes
swap
: Choose the rest of free space and format them as swap
/boot
free space
-> Add GPT Partition
ext4
/boot
-> Create
/
: Same steps as /boot
with changes
/
swap
leave empty to get rest of volume
swap
-> Done
-> Continue
** is the number of node name
Skip Ubuntu Pro setup for now
-> Done
Done
Don't check the option.
Done
Reboot Now
-> Unplug the install medium and press enter
to reboot.Check.
uname -r
-> 5.15.0-60-generic
lscpu | grep Model
-> AMD Ryzen Threadripper PRO 5975WX 32-Cores
sudo dmidecode memory | grep Speed
Configured Speed: 3200 MT/s
Speed: 2667 MT/s
lscpu | grep NUMA
NUMA node(s) : 2
NUMA node0 CPU(s) : 0-15, 32-47
NUMA node1 CPU(s) : 16-31, 48-63
Network settings.
sudo vim /etc/netplan/00-installer-config.yaml
# This is the network config written by 'subiquity'
network:
ethernets:
enp*****0:
dhcp4: true
enp*****1:
dhcp4: false
addresses: [192.168.0.2**/22] # ** would be replaced by the number of node.
nameservers:
addresses: [140.112.254.4]
routes:
- to: default
via: 192.168.0.1
version:2
sudo netplan apply
ip addr show dev enp*****1
inet 192.168.0.2**/22
resolvectl status
Link 3 (enp*****1)
Current Scopes: DNS
Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 140.112.254.4
DNS Servers: 140.112.254.4
ping 192.168.0.150
sudo -i
scp [your_account]@192.168.0.150:/work1/shared/spock/etc/hosts /etc/hosts
Update system
sudo -i
apt update
apt-get install -y linux-image-5.15.0-78-generic
Press enter
twice as kernel update UI appears.reboot
sudo -i
uname -r
5.15.0-78-generic # or above
groupmod --new-name calab tmp_account
passwd
/home/tmp_account
: rm -r /home/tmp_account
sh
link from dash
to bash
:
sudo dpkg-reconfigure dash
# Then configure UI will ask if want to set /usr/bin/sh to dash
# Press "No" to set the /usr/bin/sh to bash
Time stamp of command history
su
export HISTTIMEFORMAT='%d/%m/%y %T '
to the end of file /etc/profile
source /etc/profile
history
Set timezone
su
timedatectl set-timezone Asia/Taipei
timedatectl show
NFS settings
Client
sudo -i
apt -y install nfs-common
work1
.
ssh [your_account]@eureka00 cat /work1/shared/spock/etc/fstab >> /etc/fstab
[Login node only] Comment out the line start from spock00:/software
mkdir /software /work1 /projectV /projectW /projectX /projectY /projectZ
showmount -e spock00 # /software 192.168.0.0/24 **[Skip on login node]**
showmount -e tumaz # /home 192.168.0.0/24
showmount -e ironman # /volume1/gpucluster1 192.168.0.0/24
# /volume3/gpucluster3 192.168.0.0/24
showmount -e eater # /volume1/gpucluster3 192.168.0.0/24
# /volume2/gpucluster4 192.168.0.0/24
# /volume3/gpucluster6 192.168.0.0/24
showmount -e pacific # /volume1/gpucluster1 192.168.0.0/24
mount /software; # Skip in process on login node
mount /home; mount /work1; mount /projectW; mount /projectX; mount /projectY; mount /projectZ; mount /projectV
Check : df -h
tumaz:/home 208G 22G 176G 12% /home
ironman:/volume1/gpucluster1 70T 47T 24T 67% /work1
ironman:/volume3/gpucluster3 70T 70T 643G 100% /projectX
eater:/volume1/gpucluster3 70T 67T 3.6T 95% /projectY
eater:/volume2/gpucluster4 88T 77T 12T 88% /projectZ
eater:/volume3/gpucluster6 88T 75T 13T 86% /projectW
pacific:/volume1/gpucluster1 140T 20T 120T 15% /projectV
sudo apt -y install nfs-kernel-server
ll /software
>(/software not exist) >
mkdir /software`/etc/exports
: cp /work1/shared/spock/etc/exports /etc/exports
systemctl restart nfs-kernel-server.service
systemctl enable nfs-kernel-server.service
systemctl status nfs-kernel-server.service
# Active: active (exited)
showmount -e spock00
# /software 192.168.0.0/24
NIS settings
sudo apt -y install nis
vim /etc/yp.conf
, add follow text at the end.
domain tumaz.gpucluster.calab server tumaz
vim /etc/nsswitch.conf
passwd: files systemd nis
group: files systemd nis
shadow: files nis
hosts: files dns nis
vim /etc/defaultdomain
tumaz.gpucluster.calab
systemctl restart ypbind
systemctl enable ypbind
ll /home
yptest
: 1 test fail
ypwhich
: tumaz
su
Delete tmp_account
: userdel --remove tmp_account
It's okay to receive error message:
userdel: tmp_account mail spool (/var/mail/tmp_account) not found
userdel: tmp_account home directory (/home/tmp_account) not found
Install GPU driver
systemctl set-default multi-user.target
su
apt -y install dkms
nouveau
: Create file /etc/modprobe.d/blacklist-nouveau.conf
with content:
blacklist nouveau
options nouveau modeset=0
update-initramfs -u
su
nouveau
is disabled : lsmod | grep nouveau
This should print nothing.
Install nvidia dirver
su
sh /work1/shared/spock/package/cuda/cuda_12.1.0_530.30.02_linux.run --silent --driver
cat /proc/driver/nvidia/version
:
NVRM version: NVIDIA UNIX x86_64 Kernel Module 530.30.02
GCC version: gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04)
Copy the default profile files.
cp /work1/shared/spock/init_script/*.sh /etc/profile.d/
cp /work1/shared/spock/init_script/*.csh /etc/profile.d/
cp /work1/shared/spock/etc/rc.local /etc/
chmod +x /etc/rc.local
/etc/rc.local
as follows
/usr/bin/nvidia-persistenced --verbose
nvidia-cuda-mps-control -d
nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
by nvidia-smi -i 0 -c PROHIBITED
nvidia-smi
NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1
NTP client
su
apt -y install ntp
systemctl status ntp
systemctl enable ntp
TORQUE
apt -y install libnuma-dev
apt -y install tcl-dev tk-dev
apt -y install libntirpc-dev
sh /work1/shared/spock/package/torque/src/torque-3.0.6/spock_library_set.sh
Compile and install from source code.
cd /work1/shared/spock/package/torque/src/torque-3.0.6
# WARNING: do NOT run "spock_Install.sh" in parallel (i.e., install one node at a time)
# [Login node ] uncomment "--enable-server"
# [Computing nodes] comment "--enable-server"
sh spock_Install.sh >& log.spockXX
cd ../../etc
cp pbs_spock /etc/init.d/pbs
ln -s /etc/init.d/pbs /etc/systemd/system/
cp pbs.conf /etc/
# [Login node only]: edit "pbs.conf" to set "start_server=1" and "start_mom=0"
cp nodes_spock /var/spool/TORQUE/server_priv/nodes
systemctl enable pbs
source /etc/profile.d/torque.sh
cd ../src/torque-3.0.6/
./torque.setup root
killall pbs_server
systemctl start pbs # This error message is fine: "LOG_ERROR::No such file or directory (2) in read_config, fstat: config"
systemctl status pbs
cat /var/spool/TORQUE/pbs_environment
: LANG=en_US.utf-8
overcommit-ratio
and Disable overcommit-memory
in crontab
cp /work1/shared/spock/helper_script/disable_memory_overcommit.sh /root/
crontab -e
and add a new line:
@reboot /usr/bin/sh /root/disable_memory_overcommit.sh 1> /tmp/disable_memory_overcommit.log 2>&1
[Optional] [Login node only] Create the SSH key of root [Testing]
ssh-keygen -t rsa
cd ~/.ssh
cp id_rsa.pub authorized_keys
cp id_rsa* authorized_keys /work1/shared/spock/ssh_root/
InfiniBand
ref. https://docs.nvidia.com/networking/display/MLNXOSv3105002/Getting+Started#heading-RerunningtheWizard
lspci | grep Mellanox
01:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
apt -y install libsasl2-dev libldap2-dev libssl-dev
su
cd /work1/shared/spock/package/ib/adaptor/driver/MLNX_OFED_LINUX-5.9-0.5.6.0-ubuntu22.04-x86_64
./mlnxofedinstall
Device #1:
----------
Device Type: ConnectX6
Part Number: MCX653105A-HDA_Ax
Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
PSID: MT_0000000223
PCI Device Name: 01:00.0
Base GUID: 0c42a10300ef2a1a
Versions: Current Available
FW 20.34.1002 20.36.1010
PXE 3.6.0700 3.6.0901
UEFI 14.27.0014 14.29.0014
Status: Up to date
---------
/etc/init.d/openibd restart
reboot
su
ibstatus
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:0c42:a103:00ef:2a1a
base lid: 0xffff
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: InfiniBand
cat /etc/security/limits.conf
* soft memlock unlimited
* hard memlock unlimited
systemctl status openibd
Active: active (exited)
systemctl is-enabled openibd
enabled
systemctl status opensmd
Active: inactive (dead)
systemctl is-enabled opensmd
disabled
hca_self_test.ofed
---- Performing Adapter Device Self Test ----
Number of CAs Detected ................. 1
PCI Device Check ....................... PASS
Kernel Arch ............................ x86_64
Host Driver Version .................... MLNX_OFED_LINUX-5.9-0.5.6.0 (OFED-5.9-0.5.6): 5.15.0-69-generic
Host Driver RPM Check .................. PASS
Firmware on CA #0 HCA .................. v20.36.1010
Host Driver Initialization ............. PASS
Number of CA Ports Active .............. 1
Port State of Port #1 on CA #0 (HCA)..... UP 4X HDR (InfiniBand)
Error Counter Check on CA #0 (HCA)...... PASS
Kernel Syslog Check .................... PASS
Node GUID on CA #0 (HCA) ............... 0c:42:a1:03:00:ef:2a:1a
------------------ DONE ---------------------
ibdev2netdev -v | grep -i MCX
0000:01:00.0 mlx5_0 (MT4123 - MCX653105A-HDAT) ConnectX-6 VPI adapter card, HDR IB (200Gb/s) and 200GbE, single-port QSFP56
fw 20.36.1010 port 1 (ACTIVE) ==> ibp1s0 (Down)
spock00
ib_write_bw -aF
On spockXX
ib_write_bw -aF spock00
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x02 QPN 0x0027 PSN 0xcb8c4 RKey 0x1fffbe VAddr 0x007f9c96aaa000
remote address: LID 0x01 QPN 0x0027 PSN 0x560b74 RKey 0x1fffbe VAddr 0x007f0894517000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
8388608 5000 23452.55 23452.55 0.002932
---------------------------------------------------------------------------------------
spock00
ib_read_bw -aF
On spockXX
ib_read_bw -aF spock00
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Read BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x02 QPN 0x0028 PSN 0x593c01 OUT 0x10 RKey 0x1fffbf VAddr 0x007efc3f67f000
remote address: LID 0x01 QPN 0x0028 PSN 0xbaa0aa OUT 0x10 RKey 0x1fffbf VAddr 0x007f6fd2a85000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
8388608 1000 23517.75 23517.73 0.002940
---------------------------------------------------------------------------------------
systemctl enable mst
systemctl start mst
mst status
ssh without password for the root
cd /work1/shared/spock/ssh_root/
cp authorized_keys id_rsa* /root/.ssh/
# Verification
ssh spock00 # "yes" to "continue connecting"
ssh spockXX # "yes" to "continue connecting"
exit
exit
Intel compiler
su
mkdir /software/intel
ln -s /software/intel /opt
cd /work1/shared/spock/package/intel
sh l_BaseKit_p_2023.1.0.46401.sh -a --cli
Follow and accept the installation process.sh l_HPCKit_p_2023.1.0.46346.sh -a --cli
Follow and accept inte installation process.su
cd /opt
ln -s /software/intel
gcc compiler [skip]
su
mkdir /software/gcc
cd /work1/shared/spock/package/gcc/gcc-12.2.0
sh ./spock_Install.sh >& log.spock
cd /software/gcc
ln -s /work1/shared/spock/package/gcc ./src
ln -s 12.2.0 default
[Login node only] CUDA
cd /work1/shared/spock/package/cuda
mkdir /software/cuda
sh cuda_12.1.0_530.30.02_linux.run --silent --toolkit --installpath=/software/cuda/12.1
ln -s /software/cuda/12.1 /software/cuda/default
[Login node only] Valgrind
mkdir /software/valgrind
cd /work1/shared/eureka/package/valgrind/valgrind-3.15.0
# [Optional] Edit Fish_Install.sh
sh Fish_Install.sh >& log.spock
After installation
cd /software/valgrind
ln -s /work1/shared/spock/package/valgrind src
ln -s 3.15.0 default
[Login node only] UCX Library
mkdir /software/openucx
mkdir /software/src
cd /software/openucx/src
git clone https://github.com/openucx/ucx.git ucx
cd /software/openucx/src/ucx
./autogen.sh
mkdir build
cd build
../contrib/configure-release --prefix=/software/openucx/ucx-1.15.0_with_mt --enable-mt #enable MPI_THREAD_MULTIPLE
make && make install
[Login node only] OpenMPI
source /etc/profile.d/intel.sh
mkdir /software/openmpi
ln -s /work1/shared/spock/package/openmpi /software/openmpi/src
cd /software/openmpi/src/openmpi-4.1.5
# [Optional] Edit spock_Install_with_UCX.sh (remember to un-comment the configuration flags)
sh spock_Install_with_ucx.sh >& log.spock-all
After installation
ucx
cd /software/openmpi/4.1.5-ucx_mt-intel-2023.1.0/bin
objdump -p mpicxx | grep PATH # see whether /software/openucx/ucx-1.15.0_with_mt/lib is in RPATH
ldd mpicxx | grep ucx # see whether dynamic linker can find UCX libraries
cd /software/openmpi
unlink default # optional, if default already existed
ln -s /software/openmpi/4.1.5-ucx_mt-intel-2023.1.0 default
/etc/profile.d/openmpi.sh
source /etc/profile.d/openmpi.sh
ompi_info | grep "MCA memchecker" # MCA memchecker: valgrind (MCA v2.1.0, API v2.0.0, Component v3.1.5)
MCA
parameters
/software/openmpi/4.1.1-intel-oneapi/etc/openmpi-mca-params.conf
by root (for querying the path for configure file, one can use: ompi_info --params mca all --level 9 | grep mca_param_files
). Add the below lines (2021/07/24):
pml=ucx
osc=ucx
btl=^openib
include only ucx for pm include only ucx for osc exclude openib from btl which works for
OpenMPI 4.1.1
andUCX 1.12.0
, without giving warning message:[eureka01:45901] common_ucx.c:364 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.
[Login node only] Maui [testing] [Problematic on sed
and gcc
version]
Install sed 4.2.2
cd /work1/shared/spock/package/sed/sed-4.2.2
sh spock_Install.sh
Install maui
cd /work1/shared/spock/package/maui/maui-3.3.1/
# [Optional] Edit Fish_Install.sh
sh Fish_Install.sh >& log.spock
cd etc/
cp spock_maui.d /etc/init.d/maui.d
cp maui.sh maui.csh /etc/profile.d/
systemctl enable maui.d
cd /usr/local/maui
Edit maui.cfg as follows (an example is put at maui-3.3.1/maui.cfg.eureka)
RMPOLLINTERVAL 00:00:15
#BACKFILLPOLICY FIRSTFIT
#RESERVATIONPOLICY CURRENTHIGHEST
#NODEALLOCATIONPOLICY MINRESOURCE
# <==== Add by Nelson ====>
JOBAGGREGATIONTIME 00:00:04
# Backfill
BACKFILLPOLICY FIRSTFIT
RESERVATIONPOLICY NEVER
# Node Allocation
NODEALLOCATIONPOLICY FIRSTAVAILABLE
# Set Job Flags
JOBACTIONONNODEFAILURE CANCEL
JOBNODEMATCHPOLICY EXACTNODE
systemctl start maui.d
source /etc/profile.d/maui.sh
[Login node only] FFTW
mkdir /software/fftw
cd /work1/shared/spock/package/fftw/fftw-2.1.5-revised
# [Optional] Edit Fish_Install.sh to install in intel or gcc
sh Fish_Install.sh >& log.spock-intel
After installation
cd /software/fftw
ln -s /work1/shared/spock/package/fftw src
ln -s 2.1.5-intel-2023.1.0-openmpi-4.1.1-ucx_mt default
cd /work1/shared/eureka/package/fftw/fftw-3.3.10
# [Optional] Edit spock_Install.sh
sh spock_Install.sh >& log.spock-intel
After installation
cd /software/fftw
ln -s 3.3.10-intel-2023.1.0-openmpi-4.1.1-ucx_mt default3
[Ligin node only] HDF5
mkdir -p /software/hdf5
cd /work1/shared/spock/package/hdf5/hdf5-1.10.6
# [Optional] Edit Fish_Install.sh
sh Fish_Install.sh >& log.spock
After installation
cd /software/hdf5
ln -s /work1/shared/spock/package/hdf5 src
ln -s 1.10.6 default
[Login node only] GSL
mkdir -p /software/gsl
cd /work1/shared/spock/package/gsl/gsl-2.6
# [Optional] Edit Fish_Install.sh
sh Fish_Install.sh >& log.spock
After installation
cd /software/gsl
ln -s /work1/shared/spock/package/gsl src
ln -s 2.6 default
python2
source /etc/profile.d/openmpi.sh; source /etc/profile.d/intel.sh; source /etc/profile.d/hdf5.sh
apt -y install python2 python2-dev
apt -y install python-tk
cd /work1/shared/spock/package/python2
python2 get-pip.py
sh install-python-packages.sh
python3
apt -y install python3 python3-dev
apt -y install python3-tk
apt -y install python3-pip
cd /work1/shared/spock/package/python3
sh install-python-packages.sh
Add /usr/local/bin
to PATH
by adding a line at the end of /etc/profile
export PATH=/usr/local/bin:$PATH
Module
cd /work1/shared/spock/package/module/modules-5.1.1
make clean
./configure
make
make install
After installation
cp init/profile.sh /etc/profile.d/10-modules.sh
cp init/profile.csh /etc/profile.d/modules.csh
source init/bash
Add /software/intel/oneapi/modulefiles
to default module directories by adding the line to the file /usr/local/Modules/etc/initrc
module use /software/intel/oneapi/modulefiles
Set up preload module
ln -s /software/modulefiles/default_modules.sh /etc/profile.d/default_modules.sh
IPMI tool
apt -y install openipmi ipmitool
ipmitool sensor get "CPU Temp."
ffmpeg
apt -y install ffmpeg
gnuplot
apt -y install gnuplot-x11
screen
apt -y install screen
pdsh
apt -y install pdsh
locate
apt -y install plocate
ClamAV
apt -y install clamav clamav-daemon
systemctl stop clamav-freshclam
freshclam
systemctl start clamav-freshclam
systemctl enable clamav-freshclam
X11 server
apt -y install xorg openbox
CPU usage monitor
apt -y install sysstat
Image display feh
apt -y install feh
Disable auto update.
apt
config file at /etc/apt/apt.conf.d/20auto-upgrades
as follow.
APT::Periodic::Update-Package-Lists "0";
APT::Periodic::Unattended-Upgrade "0";
apt-config dump APT::Periodic::Update-Package-Lists
apt-config dump APT::Periodic::Unattended-Upgrade
CPU burn-in test
apt -y install stress-ng
stress-ng --cpu 0 --timeout 30m &
for i in {1..40}; do ipmitool sensor | grep "CPU Temp."; sleep 1m; done
AMD Threadripper allows temperature up to 95 degree.
And the non-critical upper limit
is 85 degree.
for spock02
the highest temperature is 82 degree.
GPU burn-in test
cd /work1/shared/spock/tests/gpu_burn-in/gpu-burn
./gpu_burn 1800 # run for 30 minutes
during the test, watch the gpu temperature shown on screen.
For RTX3080Ti, hightest temperature is 93 degree celsius.
And the non-critical upper limit
is 90 degree.
For spock02
, the highest temperature is 81 degree.
MPI suit test [Run as regular user]
git clone https://github.com/open-mpi/mpi-test-suite.git
cd mpi-test-suite
./autogen.sh
./configure CC=mpicc
make
cp /work1/shared/tests/mpi_test_suite/run_test.sh ./
qsub -I -lnodes=spockXX:ppn=32
cd {directory of mpi_test_suite}
sh run_test.sh >& spockXX.log
tail spockXX.log
# Number of failed tests: 0gamer
test
cd /work1/shared/spock/tests/gamer/
cd /work1/shared/spock/tests/gamer/
Plug both power cables and wait for all system status led bright solid green.
Connect a host PC (e.g., spock00) to the console (RJ-45) port of the switch using the supplied RJ-451-to-DB9 cable + DB9-to-USB cable
Login with the ubuntu PC
ls /dev/ttyUSB*
If there is only one USB device plug on the PC, it would show
ttyUSB0
su
privilige screen /dev/ttyUSB0 115200
and press enter
twice.Username: admin
Password: admin
Configuration (Below question will be ask at the first connection)
Do you want to use the wizard for initial configuration? yes
Step 1: Hostname? [switch-d79b5a]
Step 2: Use DHCP on mgmt0 interface? [yes] no
Step 3: Use zeroconf on mgmt0 interface [no]
Step 4: Primary IPv4 address and masklen? [0.0.0.0/0] 192.168.0.100/24
Step 5: Default gateway? 192.168.0.1
Step 6: Primary DNS server? 140.112.254.4
Step 7: Domain name?
Step 8: Enable IPv6? [yes]
Step 9: Enable IPv6 autoconfig (SLAAC) on mgmt0 interface? [no]
Step 10: Enable DHCPv6 on mgmt0 interface? [yes] no
Step 11: Admin password (Must be typed)? #set it the same as spock
Step 11: Confirm admin password?
Step 12: Monitor password (Must be typed)? #same as admin password
Step 12: Confirm monitor password?
If there is needed to resetup the configure
enable
config terminal
configuration jump-start
Check
show version
Product name: MLNX-OS
Product release: 3.8.2102
Build ID: #1-dev
Build date: 2019-11-26 21:48:40
Target arch: x86_64
Target hw: x86_64
Built by: jenkins@c776fa44be2b
Version summary: X86_64 3.8.2102 2019-11-26 21:48:40 x86_64
Product model: x86onie
Host ID: 043F72D79B5A
System serial num: MT2039J30791
System UUID: f73a8370-1456-11eb-8000-043f72d00e66
Uptime: 18h 12m 33.108s
CPU load averages: 3.11 / 3.05 / 3.01
Number of CPUs: 4
System memory: 468 MB used / 7333 MB free / 7801 MB total
Swap: 0 MB used / 0 MB free / 0 MB total
enable
show interfaces mgmt0
Interface mgmt0 status:
Comment :
Admin up : yes
Link up : yes
DHCP running : no
IP address : 192.168.0.100
Netmask : 255.255.255.0
IPv6 enabled : yes
Autoconf enabled: no
Autoconf route : yes
Autoconf privacy: no
DHCPv6 running : no
IPv6 addresses : 1
Enable OpenSM
enable
configure terminal
ib smnode switch-d79b5a enable
show ib sm
enable
no configure
Logout and exit with [CTRL + A] and [CTRL + K]
enable
configure terminal
configuration jump-start
enable
configure terminal
ib smnode switch-d79b5a enable
show ib sm
enable
on configure
Unable to negotiate with 192.168.0.100 port 22: no matching key exchange method found. Their offer: diffie-hellman-group14-sha1
Unable to negotiate with 192.168.0.100 port 22: no matching host key type found. Their offer: ssh-rsa
Above error message shown while we try to ssh to the switch with ubuntu 22.04
- Add lines at the end of the file
etc/ssh/ssh_config
KexAlgorithms=+diffie-hellman-group14-sha1 HostKeyAlgorithms=+ssh-rsa
- Restart ssh service
service ssh restart
Installation
Installation of hard wares
CPU
: AMD Threadripper Pro 5975wxMB
: ASUS Por WS WRX80 SAGE SE WIFI (Temporary)RAM
: G.skill F4-3200C16-32GVK * 8GPU
: MSI RTX 3080 Ti OCSSD
: 500GPSU
: 1200Wx. Install CUDA driver
sh /work1/xuanshan/cuda_11.7.1_515.65.01_linux.run --silent --driver
X. InfiniBand driver./mlnxofedinstall --force
Temperature performance
CPU stress test
GPU stress test
work1/xuanshan/gpu-burn/gpu_burn