Open pshipton opened 2 years ago
Hi @pshipton - si.c
is no longer present for me to look at, so could you give me the command that you're using to check the processors please?
It should be 5 virtual processors on ibm03 (and also ibm01 & 02), while the rest currently have 4.
The test source is easy, this is the line mentioned in the failure. https://github.com/eclipse-openj9/openj9/blob/master/runtime/tests/port/si.c#L1331
The error code seems unlikely to have originated from j9sysinfo_get_number_CPUs_by_type(J9PORT_CPU_ONLINE)
, as it's not making an API call (https://github.com/eclipse/omr/blob/master/port/unix/omrsysinfo.c#L2705). The error code isn't necessarily related to the failure. It seems there is a mismatch between the result of j9sysinfo_get_number_CPUs_by_type(J9PORT_CPU_ONLINE)
and j9sysinfo_get_processor_info(&procInfo)
.
I think the best bet is to modify the code to get more information. I can easily reproduce the failure on the machine.
Ah, code..... are you able to put together a small sample to try and reproduce it with?
I did try a few different commands earlier and everything that worked as root also seemed to be perfectly fine when run as the jenkins account.
I think this will explain it. Common
in DLPAR environment.
root@p8-java1-ibm03:[/root]lsdev -C | grep proc
proc0 Available 00-00 Processor
proc8 Available 00-08 Processor
proc16 Available 00-16 Processor
proc24 Available 00-24 Processor
proc40 Available 00-40 Processor
Putting together a "small" standalone sample is a lot of work, but I can give you instructions to run the test we have on p8-java1-ibm03. I can also modify it as necessary to debug.
export LIBPATH=/home/jenkins/peter/openj9-openjdk-jdk11/build/aix-ppc64-normal-server-release/images/jdk/lib/default/
/home/jenkins/peter/openj9-openjdk-jdk11/build/aix-ppc64-normal-server-release/images/test/openj9/pltest -include:j9sysinfo
I've modified it already to print the two values which don't match, and it's showing Invalid online processor count found 20 24
. The first is the _system_configuration.ncpus
and the second is the number found via iterating the processor list. Given the higher values found, it's not counting the processors, but the cpus (i.e. 4 cpus per processor). I'm in progress to modify further to get information about the cpus themselves.
I suppose it is possible there is a code problem.
I notice there is a gap between proc24 and proc40, i.e. proc32 is missing. I wonder if the code is detecting proc32 online. Edit: oh ya, I see you noticed that too.
I tried to add another print to show the online cpu numbers. but it didn't work. Anyway you can see in the j9sysinfo_testProcessorInfo test it lists all the cpu numbers, which are sequential from 0 - 23. This test checks the same criteria, only printing cpus it detects are online.
Maybe the machine just needs a reboot?
The cpus are detected via perfstat_cpu(), which apparently is returning performance stats for the processor which is no longer online. Since there are performance stats, the code is assuming the cpus are online.
https://github.com/eclipse/omr/blob/master/port/unix/omrsysinfo.c#L4896 https://github.com/eclipse/omr/blob/master/port/unix/omrsysinfo.c#L4926 https://github.com/eclipse/omr/blob/master/port/unix/omrsysinfo.c#L4959
@zl-wang do you think we need to improve this code? It's been this way for a very long time. Perhaps there is another way to determine if each cpu is online separately from getting the performance stats.
Rebooting did resolve the test failure.
Please see if perfstat_reset()
, or maybe better, perfstat_reset_partial()
(to selectively clear configuration from cache helps.
in AIX world, you should look for the number of logical CPUs (equivalent to virtual CPUs on Linux). CPU or virtual CPU on AIX means/corresponds to core or virtualized core (which each could have multiple threads or logical CPUs). since many years ago, AIX stopped running on bare-metal machines. perfstat_partition_config is a safe API to get the number of logical CPUs online, or system_configuration.ncpus would equivalently do. i am thinking any better alternative ...
alternatively, you can continue to use the existing code by and large, but adding code to test
perfstat_cpu_t->state
if it is 0, it means that CPU is offline.
https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_sanity.functional_ppc64_aix_Personal_testList_1/107 - p8-java1-ibm03 cmdLineTester_pltest_aix_0
This is on one of the new AIX machines, I expect something isn't setup correctly in order for the JVM to get the processor count.