PPPLDeepLearning / plasma-python

PPPL deep learning disruption prediction package
http://tigress-web.princeton.edu/~alexeys/docs-web/html/
79 stars 43 forks source link

Clarifying where to preprocess #81

Closed rkube closed 1 year ago

rkube commented 2 years ago

Preprocessing results on too much compute load for the traverse head node.

buildbot-princeton commented 2 years ago

Can one of the admins verify this patch?

rkube commented 2 years ago

To preprocess the dataset on traverse I need to limit the number of threads used for preprocessing https://github.com/PPPLDeepLearning/plasma-python/issues/82

felker commented 2 years ago

There are 44 cores on a node of Traverse, right? Any reason why we can only spawn 32 threads?

felker commented 2 years ago

Also I am in favor of not changing the default conf.yaml to make it specific to Princeton-based systems. So:

fs_path: '/Users/'
...
max_cpus: -1

(/Users/ isn't an ideal default, but it is generic-enough. Maybe should be set to $HOME, would need to check the parsing logic)

rkube commented 2 years ago

Each traverse node has 2 processors, 16 cores per processor and 4 threads per core. When I run pre-processing with 126 threads it starts off well but throws errors after a while. May be running into memory limits?

felker commented 2 years ago

Ah, I had assumed that the CPU model was the same as on Summit. What do you get when you run lscpu and cat /proc/cpuinfo on a Traverse compute node (just curious)?

But this problem is likely because of the 4-way SMT, which wasnt on the Tiger cluster, which the code was originally written for.

rkube commented 2 years ago

Summit and traverse are very similar, but no 100% identical.

(frnn) [rkube@traverse examples]$ lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        6
Model:               2.3 (pvr 004e 1203)
Model name:          POWER9, altivec supported
CPU max MHz:         3800.0000
CPU min MHz:         2300.0000
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-63
NUMA node8 CPU(s):   64-127
NUMA node252 CPU(s): 
NUMA node253 CPU(s): 
NUMA node254 CPU(s): 
NUMA node255 CPU(s): 
(frnn) [rkube@traverse examples]$ cat /proc/cpuinfo 
processor       : 0
cpu             : POWER9, altivec supported
clock           : 3683.000000MHz
revision        : 2.3 (pvr 004e 1203)

processor       : 1
cpu             : POWER9, altivec supported
clock           : 3683.000000MHz
revision        : 2.3 (pvr 004e 1203)

processor       : 2
cpu             : POWER9, altivec supported
clock           : 3683.000000MHz
revision        : 2.3 (pvr 004e 1203)
...
processor       : 127
cpu             : POWER9, altivec supported
clock           : 3533.000000MHz
revision        : 2.3 (pvr 004e 1203)

timebase        : 512000000
platform        : PowerNV
model           : 8335-GTH
machine         : PowerNV 8335-GTH
firmware        : OPAL
MMU             : Radix