RIKEN-SysSoft / mckernel

McKernel
GNU General Public License v2.0
106 stars 21 forks source link

About VASP run problem #6

Open p00380563 opened 3 years ago

p00380563 commented 3 years ago

I have a error when I run VASP where i assign the -np 22, i try 21 is OK. I am sure i allocated enough cpu cores to mckernel, can someone tell me the reason?

hareware: arm server: 128 cores; software: centos7.6 + openmpi 4.0.5 + mckernel 1.7

Error as follow: _[root@localhost VASP_bench_pt]# mpirun -np 22 --allow-run-as-root -x OMP_NUM_THREADS=1 /root/sysroot/bin/mcexec -n 22 ../../bin/vasp_std


There are not enough slots available in the system to satisfy the 22
slots that were requested by the application:

/root/sysroot/bin/mcexec

Either request fewer slots for your application, or make more slots
available for use._

The mckernel information: _[root@localhost sysroot]# ./sbin/ihkosctl 0 query cpu 4-110 [root@localhost sysroot]# [root@localhost sysroot]# [root@localhost sysroot]# ./sbin/ihkosctl 0 query mem 52428800000@0,52428800000@1,52428800000@2,52428800000@3 [root@localhost sysroot]# [root@localhost sysroot]# ./sbin/ihkosctl 0 kmsg [ 0]: boot_param_size: 65536 [ 0]: %: GICv3 [ 0]: setup_arm64 done. IHK/McKernel started. [ 0]: ns_per_tsc: 10000 [ 0]: KCommand Line: hidos dump_level=24 timesharing [ 0]: Physical memory: 0x2080310000 - 0x2cb5000000, 52425588736 bytes, 799951 pages available @ NUMA: 0 [ 0]: Physical memory: 0x4000000000 - 0x4c35000000, 52428800000 bytes, 800000 pages available @ NUMA: 1 [ 0]: Physical memory: 0x202000000000 - 0x202c35000000, 52428800000 bytes, 800000 pages available @ NUMA: 2 [ 0]: Physical memory: 0x204000000000 - 0x204c35000000, 52428800000 bytes, 800000 pages available @ NUMA: 3 [ 0]: NUMA: 0, Linux NUMA: 0, type: 1, available bytes: 52425588736, pages: 799951 [ 0]: NUMA: 1, Linux NUMA: 1, type: 1, available bytes: 52428800000, pages: 800000 [ 0]: NUMA: 2, Linux NUMA: 2, type: 1, available bytes: 52428800000, pages: 800000 [ 0]: NUMA: 3, Linux NUMA: 3, type: 1, available bytes: 52428800000, pages: 800000 [ 0]: NUMA 0 distances: 0 (10), 1 (16), 2 (32), 3 (33), [ 0]: NUMA 1 distances: 1 (10), 0 (16), 2 (25), 3 (32), [ 0]: NUMA 2 distances: 2 (10), 3 (16), 1 (25), 0 (32), [ 0]: NUMA 3 distances: 3 (10), 2 (16), 1 (32), 0 (33), [ 0]: Trampoline area: 0x0 [ 0]: # of cpus : 107 [ 0]: locals = ffff802080380000 [ 0]: BSP: 0 (HW ID: 4 @ NUMA 0) [ 0]: BSP: booted 106 AP CPUs [ 0]: Master channel init acked. [ 0]: Using Linux work IRQ for IKC IPI. [ 0]: Enable Host mapping vDSO. IHK/McKernel booted. [ 32]: schedule: WARNING can't schedule() while no preemption, cnt: 1 [ 32]: schedule: WARNING can't schedule() while no preemption, cnt: 1

bgerofi commented 3 years ago

Hi, why are you booting on 107 CPUs? If you insist on running 22 ranks it would be better to boot McKernel using a multiple of 22 cores, e.g., 88? For example, you could try mcreboot -c 40-127

In general we prefer to run on round number of CPU cores (preferably power of 2). Also, it's better to leave a few cores for Linux from each NUMA node and make sure that the McKernel cores are also balanced across NUMA domains.

p00380563 commented 3 years ago

hi, begerofi, as your advice, i try boot 4 cores of NUMA0 for mckernel. The mckernel information: _[root@localhost sysroot]# ./sbin/mcreboot.sh -c 12-15 -m 50000m@0 [root@localhost sysroot]# ./sbin/ihkosctl 0 kmsg [ 0]: boot_param_size: 65536 [ 0]: %: GICv3 [ 0]: setup_arm64 done. IHK/McKernel started. [ 0]: ns_per_tsc: 10000 [ 0]: KCommand Line: hidos dump_level=24 timesharing [ 0]: Physical memory: 0x2080300000 - 0x2cb5000000, 52425654272 bytes, 799952 pages available @ NUMA: 0 [ 0]: NUMA: 0, Linux NUMA: 0, type: 1, available bytes: 52425654272, pages: 799952 [ 0]: NUMA 0 distances: 0 (10), [ 0]: Trampoline area: 0x0 [ 0]: # of cpus : 4 [ 0]: locals = ffff802080340000 [ 0]: BSP: 0 (HW ID: 12 @ NUMA 0) [ 0]: BSP: booted 3 AP CPUs [ 0]: Master channel init acked. [ 0]: Using Linux work IRQ for IKC IPI. [ 0]: Enable Host mapping vDSO. IHK/McKernel booted.

And i test HPL, but there is no any output , i think cpu is hang.

_[root@localhost Linux_Arm]# mpirun -np 4 --allow-run-as-root /root/sysroot/bin/mcexec -n 4 ./xhpl


I try mcstat command, but the output is no change for three times: _[root@localhost sysroot]# ./bin/mcstat ------- memory (GB) ------- ------- tsc ------ --- thread --- total current max system user current max 48.825 0.147 0.147 39 3 12 12 cpuacct_usage_percpu[0] = 5935640 cpuacct_usage_percpu[1] = 5942580 cpuacct_usage_percpu[2] = 5823800 cpuacct_usage_percpu[3] = 5974470 cpuacct_usage_percpu[4] = 0 cpuacct_usage_percpu[5] = 0 cpuacct_usage_percpu[6] = 0 cpuacct_usage_percpu[7] = 0 cpuacct_usage_percpu[8] = 0 cpuacct_usage_percpu[9] = 0 cpuacct_usage_percpu[10] = 0 cpuacct_usage_percpu[11] = 0 [root@localhost sysroot]# ./bin/mcstat ------- memory (GB) ------- ------- tsc ------ --- thread --- total current max system user current max 48.825 0.147 0.147 39 3 12 12 cpuacct_usage_percpu[0] = 5935640 cpuacct_usage_percpu[1] = 5942580 cpuacct_usage_percpu[2] = 5823800 cpuacct_usage_percpu[3] = 5974470 cpuacct_usage_percpu[4] = 0 cpuacct_usage_percpu[5] = 0 cpuacct_usage_percpu[6] = 0 cpuacct_usage_percpu[7] = 0 cpuacct_usage_percpu[8] = 0 cpuacct_usage_percpu[9] = 0 cpuacct_usage_percpu[10] = 0 cpuacct_usage_percpu[11] = 0 [root@localhost sysroot]# ./bin/mcstat ------- memory (GB) ------- ------- tsc ------ --- thread --- total current max system user current max 48.825 0.147 0.147 39 3 12 12 cpuacct_usage_percpu[0] = 5935640 cpuacct_usage_percpu[1] = 5942580 cpuacct_usage_percpu[2] = 5823800 cpuacct_usage_percpu[3] = 5974470 cpuacct_usage_percpu[4] = 0 cpuacct_usage_percpu[5] = 0 cpuacct_usage_percpu[6] = 0 cpuacct_usage_percpu[7] = 0 cpuacct_usage_percpu[8] = 0 cpuacct_usage_percpu[9] = 0 cpuacct_usage_percpu[10] = 0 cpuacct_usagepercpu[11] = 0

i don't know what happen, maybe something i configure is wrong?

And i stop mckernel: [root@localhost sysroot]# ./sbin/mcstop+release.sh error: destroying OS instance 0 error: destroying OS instance 0 error: destroying OS instance 0 error: destroying OS instance 0 error: destroying OS instance 0 error: destroying LWK instance 0 failed [root@localhost sysroot]#