Closed rkube closed 1 year ago
Can one of the admins verify this patch?
To preprocess the dataset on traverse I need to limit the number of threads used for preprocessing https://github.com/PPPLDeepLearning/plasma-python/issues/82
There are 44 cores on a node of Traverse, right? Any reason why we can only spawn 32 threads?
Also I am in favor of not changing the default conf.yaml
to make it specific to Princeton-based systems. So:
fs_path: '/Users/'
...
max_cpus: -1
(/Users/
isn't an ideal default, but it is generic-enough. Maybe should be set to $HOME
, would need to check the parsing logic)
Each traverse node has 2 processors, 16 cores per processor and 4 threads per core. When I run pre-processing with 126 threads it starts off well but throws errors after a while. May be running into memory limits?
Ah, I had assumed that the CPU model was the same as on Summit. What do you get when you run lscpu
and cat /proc/cpuinfo
on a Traverse compute node (just curious)?
But this problem is likely because of the 4-way SMT, which wasnt on the Tiger cluster, which the code was originally written for.
Summit and traverse are very similar, but no 100% identical.
(frnn) [rkube@traverse examples]$ lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 4
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 6
Model: 2.3 (pvr 004e 1203)
Model name: POWER9, altivec supported
CPU max MHz: 3800.0000
CPU min MHz: 2300.0000
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-63
NUMA node8 CPU(s): 64-127
NUMA node252 CPU(s):
NUMA node253 CPU(s):
NUMA node254 CPU(s):
NUMA node255 CPU(s):
(frnn) [rkube@traverse examples]$ cat /proc/cpuinfo
processor : 0
cpu : POWER9, altivec supported
clock : 3683.000000MHz
revision : 2.3 (pvr 004e 1203)
processor : 1
cpu : POWER9, altivec supported
clock : 3683.000000MHz
revision : 2.3 (pvr 004e 1203)
processor : 2
cpu : POWER9, altivec supported
clock : 3683.000000MHz
revision : 2.3 (pvr 004e 1203)
...
processor : 127
cpu : POWER9, altivec supported
clock : 3533.000000MHz
revision : 2.3 (pvr 004e 1203)
timebase : 512000000
platform : PowerNV
model : 8335-GTH
machine : PowerNV 8335-GTH
firmware : OPAL
MMU : Radix
Preprocessing results on too much compute load for the traverse head node.