eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 721 forks source link

AIX cmdLineTester_pltest testOnlineProcessorCount Permission denied #14370

Open pshipton opened 2 years ago

pshipton commented 2 years ago

https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_sanity.functional_ppc64_aix_Personal_testList_1/107 - p8-java1-ibm03 cmdLineTester_pltest_aix_0

17:40:23   [ERR] 1: j9sysinfo_testOnlineProcessorCount
17:40:23   [ERR]    /home/jenkins/workspace/Build_JDK11_ppc64_aix_Personal/openj9/runtime/tests/port/si.c line 1331: Invalid online processor count found.
17:40:23   [ERR] 
17:40:23   [ERR]        LastErrorNumber: -102
17:40:23   [ERR]        LastErrorMessage: Permission denied

This is on one of the new AIX machines, I expect something isn't setup correctly in order for the JVM to get the processor count.

sej-jackson commented 2 years ago

Hi @pshipton - si.c is no longer present for me to look at, so could you give me the command that you're using to check the processors please?

It should be 5 virtual processors on ibm03 (and also ibm01 & 02), while the rest currently have 4.

pshipton commented 2 years ago

The test source is easy, this is the line mentioned in the failure. https://github.com/eclipse-openj9/openj9/blob/master/runtime/tests/port/si.c#L1331

The error code seems unlikely to have originated from j9sysinfo_get_number_CPUs_by_type(J9PORT_CPU_ONLINE), as it's not making an API call (https://github.com/eclipse/omr/blob/master/port/unix/omrsysinfo.c#L2705). The error code isn't necessarily related to the failure. It seems there is a mismatch between the result of j9sysinfo_get_number_CPUs_by_type(J9PORT_CPU_ONLINE) and j9sysinfo_get_processor_info(&procInfo).

I think the best bet is to modify the code to get more information. I can easily reproduce the failure on the machine.

sej-jackson commented 2 years ago

Ah, code..... are you able to put together a small sample to try and reproduce it with?

I did try a few different commands earlier and everything that worked as root also seemed to be perfectly fine when run as the jenkins account.

aixtools commented 2 years ago

I think this will explain it. Common in DLPAR environment.

pshipton commented 2 years ago

Putting together a "small" standalone sample is a lot of work, but I can give you instructions to run the test we have on p8-java1-ibm03. I can also modify it as necessary to debug.

export LIBPATH=/home/jenkins/peter/openj9-openjdk-jdk11/build/aix-ppc64-normal-server-release/images/jdk/lib/default/
/home/jenkins/peter/openj9-openjdk-jdk11/build/aix-ppc64-normal-server-release/images/test/openj9/pltest  -include:j9sysinfo

I've modified it already to print the two values which don't match, and it's showing Invalid online processor count found 20 24. The first is the _system_configuration.ncpus and the second is the number found via iterating the processor list. Given the higher values found, it's not counting the processors, but the cpus (i.e. 4 cpus per processor). I'm in progress to modify further to get information about the cpus themselves.

I suppose it is possible there is a code problem.

pshipton commented 2 years ago

I notice there is a gap between proc24 and proc40, i.e. proc32 is missing. I wonder if the code is detecting proc32 online. Edit: oh ya, I see you noticed that too.

pshipton commented 2 years ago

I tried to add another print to show the online cpu numbers. but it didn't work. Anyway you can see in the j9sysinfo_testProcessorInfo test it lists all the cpu numbers, which are sequential from 0 - 23. This test checks the same criteria, only printing cpus it detects are online.

pshipton commented 2 years ago

Maybe the machine just needs a reboot?

pshipton commented 2 years ago

The cpus are detected via perfstat_cpu(), which apparently is returning performance stats for the processor which is no longer online. Since there are performance stats, the code is assuming the cpus are online.

https://github.com/eclipse/omr/blob/master/port/unix/omrsysinfo.c#L4896 https://github.com/eclipse/omr/blob/master/port/unix/omrsysinfo.c#L4926 https://github.com/eclipse/omr/blob/master/port/unix/omrsysinfo.c#L4959

@zl-wang do you think we need to improve this code? It's been this way for a very long time. Perhaps there is another way to determine if each cpu is online separately from getting the performance stats.

pshipton commented 2 years ago

Rebooting did resolve the test failure.

aixtools commented 2 years ago

Please see if perfstat_reset(), or maybe better, perfstat_reset_partial() (to selectively clear configuration from cache helps.

zl-wang commented 2 years ago

in AIX world, you should look for the number of logical CPUs (equivalent to virtual CPUs on Linux). CPU or virtual CPU on AIX means/corresponds to core or virtualized core (which each could have multiple threads or logical CPUs). since many years ago, AIX stopped running on bare-metal machines. perfstat_partition_config is a safe API to get the number of logical CPUs online, or system_configuration.ncpus would equivalently do. i am thinking any better alternative ...

zl-wang commented 2 years ago

alternatively, you can continue to use the existing code by and large, but adding code to test perfstat_cpu_t->state

if it is 0, it means that CPU is offline.