Open dsouzai opened 4 years ago
fyi @fjeremic @andrewcraik @gita-omr @knn-k
My personal preference is 1. Targeting the processor the SCC is generated on is currently what happens but that puts the burden on the user to find an old enough machine. I also do not prefer having a mapping between CPU version and some set of CPU features because it doesn't reflect the reality of which CPU features are available on which CPU version, and is a non-standard mapping that has be maintained and documented.
I'm fine starting with a system that specifies all the features to target so we have something working.
From a usability perspective, it won't take long for us (and end users!) to need a way to compactly specify groups of features. If this is a quasi-mapping to process type as proposed in [3], great. Some other logical way to group features together is fine by me as well.
I think grouping based on the hardware platform only makes sense when the platform itself defines such a logical grouping. For example, the aarch64 architecture defines groups of features as a package and optional packages can be included or not in an implementation - grouping our queries to match these architectural groupings makes sense. On a platform like x86 where feature flags are used to determine feature support and feature support is not necessarily tied to a generation across manufacturers then trying to tie these features to a generation / hardware feature group makes less sense. As a result, option 3 is not the right general solution in my mind.
I'm not sure if 1 or 2 is right or if we should just define the 'default' AOT configuration and provide documentation and help on how to enable/disable features. Logical groupings based on what the compiler is accelerating or similar might make sense, but artificial hardware groupings not reflected in the underlying hardware platform (eg mapping features to processor levels when the feature is not truly tied to the processor level) seems counter productive.
I agree the usability story does need some consideration/attention. Given a major use case is a docker image or similar are there any container properties we could use to help figure out what the base config should be?
Another option might be something like the Xtrace and Xdump builders to help build an AOT hardware config?
How does logical groupings differ from the mtune/march settings of GCC?
We have to keep the goal in mind - making it easy for a user to create a portable SCC which includes AOT code that will be broadly applicable. Mapping features to logical groups, whether mtune/march-style or our own creation, gives users a reasonable way to control the baseline systems they want to target.
User's don't understand, and don't want to understand, the right set of flags to enable for the hardware they are deploying on. They will, at the most, know they are targeting "Haswell" processors, or "Skylake", or .... when they buy their instances from cloud providers. They just want their AOT code to work in that world, even if it's not the fastest they could get, as they don't control the hardware.
Another option might be something like the Xtrace and Xdump builders to help build an AOT hardware config?
This sounds a lot like having pre-built configurations :)
After taking a closer look at gcc's march
option, I don't think we should follow that approach. GCC does maintain a mapping between processor version and some set of processor features. However, GCC is widely used, and hence the mapping they maintain can be more or less considered standard. I would rather not define a new mapping that's only applicable to us. I also don't want to have to depend on GCC's mapping.
From a usability perspective, it won't take long for us (and end users!) to need a way to compactly specify groups of features.
I'm not convinced that's true. This is something we're only trying to define for AOT code. As you said above:
User's don't understand, and don't want to understand, the right set of flags to enable for the hardware they are deploying on.
They just want their AOT code to work in that world, even if it's not the fastest they could get, as they don't control the hardware.
I don't see why anyone would care whether the AOT generated code is targeting an ivybridge machine even though the JVM is running on say a skylake, so long as they get the benefits.
Having a single definition makes it easier to document and makes it consistent no matter where the SCC is generated (portability being the main goal we're after here). JIT code is still going to target the machine the JVM is running on, so the idea here is the same as always: AOT code gets your app started fast, JIT recompilation gets your app's steady state performance fast.
The set of default features I'm thinking shouldn't be targeting something as old as say core2duo or P4. We can pick some reasonable set of features that should exist on most machines today, and we can easily add downgrading logic to take care of what happens when some features don't.
We now have the infrastructure to specify processor types on the fly for each compilation. It's time to decide on the actual set of portable AOT processor defaults for each platform. @dsouzai @vijaysun-omr @mpirvu @andrewcraik @fjeremic @gita-omr . Could you guys please have some discussion to get this going? Thanks!
Also Marius suggested that we should be able to specify processor via command line options.
Here's the list of processors:
/* List of all processors that are currently supported by OMR's processor detection */
typedef enum OMRProcessorArchitecture {
OMR_PROCESSOR_UNDEFINED,
OMR_PROCESSOR_FIRST,
// 390 Processors
OMR_PROCESSOR_S390_FIRST = OMR_PROCESSOR_FIRST,
OMR_PROCESSOR_S390_UNKNOWN = OMR_PROCESSOR_S390_FIRST,
OMR_PROCESSOR_S390_GP6,
OMR_PROCESSOR_S390_Z10 = OMR_PROCESSOR_S390_GP6,
OMR_PROCESSOR_S390_GP7,
OMR_PROCESSOR_S390_GP8,
OMR_PROCESSOR_S390_GP9,
OMR_PROCESSOR_S390_Z196 = OMR_PROCESSOR_S390_GP9,
OMR_PROCESSOR_S390_GP10,
OMR_PROCESSOR_S390_ZEC12 = OMR_PROCESSOR_S390_GP10,
OMR_PROCESSOR_S390_GP11,
OMR_PROCESSOR_S390_Z13 = OMR_PROCESSOR_S390_GP11,
OMR_PROCESSOR_S390_GP12,
OMR_PROCESSOR_S390_Z14 = OMR_PROCESSOR_S390_GP12,
OMR_PROCESSOR_S390_GP13,
OMR_PROCESSOR_S390_Z15 = OMR_PROCESSOR_S390_GP13,
OMR_PROCESSOR_S390_GP14,
OMR_PROCESSOR_S390_ZNEXT = OMR_PROCESSOR_S390_GP14,
OMR_PROCESSOR_S390_LAST = OMR_PROCESSOR_S390_GP14,
// ARM Processors
OMR_PROCESSOR_ARM_FIRST,
OMR_PROCESSOR_ARM_UNKNOWN = OMR_PROCESSOR_ARM_FIRST,
OMR_PROCESSOR_ARM_V6,
OMR_PROCESSOR_ARM_V7,
OMR_PROCESSOR_ARM_LAST = OMR_PROCESSOR_ARM_V7,
// ARM64 / AARCH64 Processors
OMR_PROCESSOR_ARM64_FISRT,
OMR_PROCESSOR_ARM64_UNKNOWN = OMR_PROCESSOR_ARM64_FISRT,
OMR_PROCESSOR_ARM64_V8_A,
OMR_PROCESSOR_ARM64_LAST = OMR_PROCESSOR_ARM64_V8_A,
// PPC Processors
OMR_PROCESSOR_PPC_FIRST,
OMR_PROCESSOR_PPC_UNKNOWN = OMR_PROCESSOR_PPC_FIRST,
OMR_PROCESSOR_PPC_RIOS1,
OMR_PROCESSOR_PPC_PWR403,
OMR_PROCESSOR_PPC_PWR405,
OMR_PROCESSOR_PPC_PWR440,
OMR_PROCESSOR_PPC_PWR601,
OMR_PROCESSOR_PPC_PWR602,
OMR_PROCESSOR_PPC_PWR603,
OMR_PROCESSOR_PPC_82XX,
OMR_PROCESSOR_PPC_7XX,
OMR_PROCESSOR_PPC_PWR604,
// The following processors support SQRT in hardware
OMR_PROCESSOR_PPC_HW_SQRT_FIRST,
OMR_PROCESSOR_PPC_RIOS2 = OMR_PROCESSOR_PPC_HW_SQRT_FIRST,
OMR_PROCESSOR_PPC_PWR2S,
// The following processors are 64-bit implementations
OMR_PROCESSOR_PPC_64BIT_FIRST,
OMR_PROCESSOR_PPC_PWR620 = OMR_PROCESSOR_PPC_64BIT_FIRST,
OMR_PROCESSOR_PPC_PWR630,
OMR_PROCESSOR_PPC_NSTAR,
OMR_PROCESSOR_PPC_PULSAR,
// The following processors support the PowerPC AS architecture
// PPC AS includes the new branch hint 'a' and 't' bits
OMR_PROCESSOR_PPC_AS_FIRST,
OMR_PROCESSOR_PPC_GP = OMR_PROCESSOR_PPC_AS_FIRST,
OMR_PROCESSOR_PPC_GR,
// The following processors support VMX
OMR_PROCESSOR_PPC_VMX_FIRST,
OMR_PROCESSOR_PPC_GPUL = OMR_PROCESSOR_PPC_VMX_FIRST,
OMR_PROCESSOR_PPC_HW_ROUND_FIRST,
OMR_PROCESSOR_PPC_HW_COPY_SIGN_FIRST = OMR_PROCESSOR_PPC_HW_ROUND_FIRST,
OMR_PROCESSOR_PPC_P6 = OMR_PROCESSOR_PPC_HW_COPY_SIGN_FIRST,
OMR_PROCESOSR_PPC_ATLAS,
OMR_PROCESSOR_PPC_BALANCED,
OMR_PROCESSOR_PPC_CELLPX,
// The following processors support VSX
OMR_PROCESSOR_PPC_VSX_FIRST,
OMR_PROCESSOR_PPC_P7 = OMR_PROCESSOR_PPC_VSX_FIRST,
OMR_PROCESSOR_PPC_P8,
OMR_PROCESSOR_PPC_P9,
OMR_PROCESSOR_PPC_LAST = OMR_PROCESSOR_PPC_P9,
// X86 Processors
OMR_PROCESSOR_X86_FIRST,
OMR_PROCESSOR_X86_UNKNOWN = OMR_PROCESSOR_X86_FIRST,
OMR_PROCESSOR_X86_INTEL_FIRST,
OMR_PROCESSOR_X86_INTELPENTIUM = OMR_PROCESSOR_X86_INTEL_FIRST,
OMR_PROCESSOR_X86_INTELP6,
OMR_PROCESSOR_X86_INTELPENTIUM4,
OMR_PROCESSOR_X86_INTELCORE2,
OMR_PROCESSOR_X86_INTELTULSA,
OMR_PROCESSOR_X86_INTELNEHALEM,
OMR_PROCESSOR_X86_INTELWESTMERE,
OMR_PROCESSOR_X86_INTELSANDYBRIDGE,
OMR_PROCESSOR_X86_INTELIVYBRIDGE,
OMR_PROCESSOR_X86_INTELHASWELL,
OMR_PROCESSOR_X86_INTELBROADWELL,
OMR_PROCESSOR_X86_INTELSKYLAKE,
OMR_PROCESSOR_X86_INTEL_LAST = OMR_PROCESSOR_X86_INTELSKYLAKE,
OMR_PROCESSOR_X86_AMD_FIRST,
OMR_PROCESSOR_X86_AMDK5 = OMR_PROCESSOR_X86_AMD_FIRST,
OMR_PROCESSOR_X86_AMDK6,
OMR_PROCESSOR_X86_AMDATHLONDURON,
OMR_PROCESSOR_X86_AMDOPTERON,
OMR_PROCESSOR_X86_AMDFAMILY15H,
OMR_PROCESSOR_X86_AMD_LAST = OMR_PROCESSOR_X86_AMDFAMILY15H,
OMR_PROCESSOR_X86_LAST = OMR_PROCESSOR_X86_AMDFAMILY15H,
OMR_PROCESOR_RISCV32_UNKNOWN,
OMR_PROCESOR_RISCV64_UNKNOWN,
OMR_PROCESSOR_DUMMY = 0x40000000 /* force wide enums */
} OMRProcessorArchitecture;
Refer to omr/include_core/omrport.h for the feature flags.
I am thinking that being able to select the features for AOT through command line options is still important. In some instances, the IT people may know that the JVM is not going to run on machines older than X (pick your architecture) and may want to target that architecture as the baseline. Therefore I am in favor of supporting the logical grouping @DanHeidinga mentioned. To get this going off the ground we could have a single grouping to start with and gradually add new groupings targeting newer architectures.
@harryyu1994 just so I'm clear on what you are expecting when you said It's time to decide on the actual set of portable AOT processor defaults for each platform.
... did you mean we should pick the default processor for each platform from the lists that you pasted in your last comment ?
One approach could be to pick some processor that reasonably old such that a large proportion of users can reasonably be expected to have something newer than that and then force the codegen to assume that processor type and see how much of a regression you get from handicapping the codegen in this way before deciding if we should go ahead or not. Is this is the approach you were also thinking of and if so, were you essentially looking for someone familiar with the different codegens to make a processor suggestion for their platform ?
@harryyu1994 just so I'm clear on what you are expecting when you said
It's time to decide on the actual set of portable AOT processor defaults for each platform.
... did you mean we should pick the default processor for each platform from the lists that you pasted in your last comment ?
Yes, we should pick default processor for each platform from the list I pasted as well as default features (for x86).
One approach could be to pick some processor that reasonably old such that a large proportion of users can reasonably be expected to have something newer than that and then force the codegen to assume that processor type and see how much of a regression you get from handicapping the codegen in this way before deciding if we should go ahead or not. Is this is the approach you were also thinking of and if so, were you essentially looking for someone familiar with the different codegens to make a processor suggestion for their platform ?
Yes, I'm looking for processor suggestions from people.
For x86 I am proposing OMR_PROCESSOR_X86_INTELSANDYBRIDGE, to be the baseline for relocatable code. It's a 9-year old architecture that has AVX and AES instructions. If at all possible I would like this baseline to work on both Intel and AMD processors as we start to see more and more AMD EPYC instances in the cloud.
Sounds reasonable to me, though I guess the true test will be a performance run to see how much we lose by assuming this older level of architecture on a current machine, e.g. Skylake.
For Z and Power, just a single processor type would be sufficient as the feature flags are set based on the processor type. For x86, we need the set of feature flags as well. (The processor type may not matter that much)
We need to come up with a mapping of processor type to feature flags for x86
Note to self: I need to watch out for the few instances that processor type do matter on x86, also need to look into if it's possible for the baseline to work on both intel and AMD.
These are the flags listed for my machine which uses ivybridge CPUs:
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm arat pln pts md_clear spec_ctrl intel_stibp flush_l1d
I am reading that ivybridge added rdrand
and F16C
instructions on top of sandybridge, so we should exclude those from the list above.
We should care only about the set of the flags that the optimizer is trying to exploit though.
We should care only about the set of the flags that the optimizer is trying to exploit though.
In my upcoming changes:
// Only enable the features that compiler currently uses
uint32_t enabledFeatures [] = {OMR_FEATURE_X86_FPU, OMR_FEATURE_X86_CX8, OMR_FEATURE_X86_CMOV,
OMR_FEATURE_X86_MMX, OMR_FEATURE_X86_SSE, OMR_FEATURE_X86_SSE2,
OMR_FEATURE_X86_SSSE3, OMR_FEATURE_X86_SSE4_1, OMR_FEATURE_X86_POPCNT,
OMR_FEATURE_X86_AESNI, OMR_FEATURE_X86_OSXSAVE, OMR_FEATURE_X86_AVX,
OMR_FEATURE_X86_FMA, OMR_FEATURE_X86_HLE, OMR_FEATURE_X86_RTM};
We maintain this array that contains all the features that the optimizer tries to exploit We will mask out all the features that we don't care about.
Had some offline discussion with Marius and here's some notes
These are the features present in enabledFeatures []
that a sandybridge architecture does not have:
OMR_FEATURE_X86_OSXSAVE --> OS has enabled XSETBV/XGETBV instructions to access XCR0
OMR_FEATURE_X86_FMA --> FMA extensions using YMM state
OMR_FEATURE_X86_HLE -> Hardware lock elision
OMR_FEATURE_X86_RTM -> Restricted transactional memory
@harryyu1994 for your 3rd "note" did you mean that
a) the default processor level is newer than the host, i.e. it is some really old machine OR b) the user can specify an option to produce AOT code for a newer processor than the host
And does "disabling" mean silently not generating AOT code in that scenario OR something like reporting a usage error of some sort ?
2. If AOT is already available in the SCC, we want to use the processor in the SCC instead of what user specifies. Output a warning message to inform user that their processor wasn't used
I'm reading this to mean there is 1 processor defined for the SCC. Does it make sense to allow different layers of a multi-layer SCC to define a different, more restrictive (ie: newer), processor level?
We should agree on whether this is a desirable principle rather than worry about the details now.
@harryyu1994 for your 3rd "note" did you mean that
a) the default processor level is newer than the host, i.e. it is some really old machine OR b) the user can specify an option to produce AOT code for a newer processor than the host
And does "disabling" mean silently not generating AOT code in that scenario OR something like reporting a usage error of some sort ?
@vijaysun-omr I meant b). I was thinking about reporting a usage error to user. Would we ever have a use case where the user only wants to generate AOT compile for a certain processor level? So basically preparing the SCC for others.
I'm reading this to mean there is 1 processor defined for the SCC. Does it make sense to allow different layers of a multi-layer SCC to define a different, more restrictive (ie: newer), processor level?
We should agree on whether this is a desirable principle rather than worry about the details now.
@DanHeidinga
My understanding of the multi-layer cache is that it's for storing SCCs in docker images. Basically each layer of the docker image will want to have its own SCC. So for multi-layers I was thinking something like this: The base layer will have the least restrictive processor and the outermost layer will have the most restrictive processor. When we add another layer, as long as it's equivalent or more restrictive than the layer below it we are going to allow it. Or maybe everything in docker should use the lowest possible processor settings to maximize portability.
What I meant in the original notes was for a different scenario: (so basically only considering the current outermost layer) We already have a SCC that the JVM can run with. Then if the user wants to change it to use to a different processor, maybe we want to reject that operation? Or maybe we want to treat it as an overwrite operation where we ditch the original AOT code in the SCC and generate new code. (Not sure if this is feasible, user could just delete the SCC and generate a new one)
We already have a SCC that the JVM can run with. Then if the user wants to change it to use to a different processor, maybe we want to reject that operation? Or maybe we want to treat it as an overwrite operation where we ditch the original AOT code in the SCC and generate new code. (Not sure if this is feasible, user could just delete the SCC and generate a new one)
Another approach would be to associate the AOT code with the Processor it requires. This would allow mixing AOT with different Processor requirements in the same cache. Not terribly useful when running on a host system but possibly more useful when a cache is being shipped around in Docker or other ways
In relation to comment : https://github.com/eclipse/openj9/issues/7966#issuecomment-634234270
I feel it is okay to report a usage error in the case when a user can specify an option to produce AOT code for a newer processor than the host. If this functionality is deemed important in the future, it can be added at that time but I don't see the need to do this work now.
I do see merit in @harryyu1994 comment "The base layer will have the least restrictive processor and the outermost layer will have the most restrictive processor. When we add another layer, as long as it's equivalent or more restrictive than the layer below it we are going to allow it. ", i.e. philosophically this may be something to allow.
In practical/implementation terms, I wonder if this is a use case that we support down the line rather than get bogged down with at present.
I do see merit in @harryyu1994 comment "The base layer will have the least restrictive processor and the outermost layer will have the most restrictive processor. When we add another layer, as long as it's equivalent or more restrictive than the layer below it we are going to allow it. ", i.e. philosophically this may be something to allow.
In practical/implementation terms, I wonder if this is a use case that we support down the line rather than get bogged down with at present.
How usable is this feature without this?
My mental model is that the docker image may be built up by different parts of the CI at different times on different machines (lots of variability in that process)!
A user may pull an Adopt created docker image with a default cache in it and then in their own CI build a common framework layer with a new cache layer. Finally, each app may reuse that image and add their own classes into the cache.
If all three of those operations happen on different machines, we need to either "pin" the processor level to the one created in the JDK base image (ie: the Adopt layer) or allow each layer to add further restrictions.
A user doesn't want to have a bigger docker image due to AOT code they can't execute.
Oh, and we'll need some API, maybe in the SCC print-stats
option to tell the current processor level of the cache so later layers can specify exactly the same one.
I thought the proposal was still to settle on some reasonably old arch level (e.g. ivybridge on X86) such that "by default" the risk of adding AOT code that would not run is fairly low at any layer of the SCC.
This default arch level only changes if/when a user explicitly starts upgrading the arch version by specifying an option because they know something about the servers being targetted, but this was'nt expected to be particularly common in practice (at least that is what I thought).
I think there are many ways we can improve the usability of the SCC in the Docker situation. However, it is going to be a non-trivial amount of work. Therefore, I think the approach we should take is:
1 and 2 will solve the immediate limitations we have right now. 3 will facilitate the ability to do 4. Once we have all 4 of these in place, then we can do what @DanHeidinga would like to see, namely different arch targets at different layers of the SCC.
~The risk of different arch targets at different layers of the SCC means that depending on where you're running, the JVM might not be able to use all the code in the SCC.~ Dan addressed this above:
If all three of those operations happen on different machines, we need to either "pin" the processor level to the one created in the JDK base image (ie: the Adopt layer) or allow each layer to add further restrictions.
Regarding:
The base layer will have the least restrictive processor and the outermost layer will have the most restrictive processor.
This would be relatively trivial to do once we have 1-4 above implemented.
That sounds like a reasonable approach.
To make sure we agree on what is going to be implemented as a first step I am detailing the workflow for (1) and (2):
At step 4.1 it's debatable whether to fail the JVM with an error message (like I proposed) or silently compile for the current host. I feel that the latter solution, while improving consumability, may create silent performance problems.
4.1 is the only point that concerns me, but I can see an argument for either direction, so I'm ok with whichever choice we take.
Writing this out explicitly as I had to think about it for a while before coming to the conclusion that Marius's workflow is the right one.
With AOT in the SCC, there are two distinct use cases:
The proposed workflow makes the deployment easier for the second case by making portability the default. It trades performance (tbd) in the local machine case, which can be restored by an option, for the easier portability.
Marius's flow, at a high level, is:
4.1 If SCC is empty, fail the JVM and ask user to either
This is the right default as it's easier to make an error case legal in the future than to turn a legal case into an error.
In the future, we'll need work thru how this changes when specifying an explicit processor level different than the portability processor or current host.
Is'nt step 4.1 when running SCC on a local machine (i.e. not for packaging into docker) turning something that used to be legal into an error condition ?
Is'nt step 4.1 when running SCC on a local machine (i.e. not for packaging into docker) turning something that used to be legal into an error condition ?
Yes, which is why this point is contentious. The scenario it tries to avoid is the following:
This problem would not appear if we had an automatic way of figuring it out whether the SCC is for the local machine or for a container.
I'll mention this since we are still designing the solution here and see what you think.
Rather than break existing users who are building an SCC on a local machine that is very old, should we consider an option (say -Xshareclasses:portable) that when specified would enable the portable SCC logic ?
The advantage of this would be that we do not break existing users in the local machine scenario and for those building in an SCC into their containers (a relatively new use case that we are supporting) it becomes recommended (but not mandatory) to run with the "portable" option in the build step. If they did not specify the "portable" option, they get a non-portable SCC with the set of problems that we have today before Harry's work. i.e. it does not make any existing use case worse while still offering a way to improve the use case that is problematic currently, namely building an SCC into the container image.
If we went this route, we would obviously specify the "portable" option when we are building our own SCC in the OpenJ9 docker image and also in any other cases we are aware of where an SCC is being packaged into a docker image, e.g. Open Liberty.
Rather than break existing users who are building an SCC on a local machine that is very old, should we consider an option (say
-Xshareclasses:portable
) that when specified would enable the portable SCC logic ?
It depends on how old the default processor we pick will be. How many users (rough estimate) will be affected by this change?
I started to write out an argument about why needing a sub-option is a worse user experience and then had to catch myself because the SCC already require a lot of sub-options: usually cache name, cachedir, sizes, and in the case of the MultiLayer cache, createLayer.
Adding one more "portable
" isn't unreasonable to avoid breaking any existing users. The case I'd want to check is when building a Docker image, are the RUN
commands considered by the JVM to be inside a container? If they are, we could default to being portable
when running in a docker container as well.
With this proposal, the high level flow becomes:
portable
isn't specified ("Current host wins")are the RUN commands considered by the JVM to be inside a container?
@ashu-mehra Do you know the answer to this question? If not could you please tell @harryyu1994 the API to detect container presence so that Harry could do a small experiment?
The user could also be generating the SCC/AOT on the side and simply copying the SCC file into the container. I've done that in past in my small experiments, but I don't know whether that is common practice.
are the RUN commands considered by the JVM to be inside a container?
I believe so. For every RUN command a new temporary container is spawned.
Port library API to detect container presence is omrsysinfo_is_running_in_container
[1], although it just returns the flag which actually gets set during port library startup [2].
[1] https://github.com/eclipse/omr/blob/cbc988cc7e82cd80a608492bc13e4ced64c744e1/port/unix/omrsysinfo.c#L5637 [2] https://github.com/eclipse/omr/blob/cbc988cc7e82cd80a608492bc13e4ced64c744e1/port/unix/omrsysinfo.c#L3081
I am wondering should the portable
be a sub-option for -Xaot
instead of -Xshareclasses
, because portability is the characteristic of AOT code, not SCC as such and SCC can be used without AOT as well.
I am wondering should the
portable
be a sub-option for-Xaot
instead of-Xshareclasses
, because portability is the characteristic of AOT code, not SCC as such and SCC can be used without AOT as well.
Users are already specifying -Xshareclasses
for the kinds of portable docker environments we're discussing here so I would prefer to keep this as a sub-option of the -Xshareclasses
rather than introducing them to a new, also complicated, -Xaot
option.
I'm open to supporting a -XX:[+-]PortableSCC
option in addition to -Xshareclasses:portable
given the better experience with -XX
options due to them being ignored by default which allows some minimal command line "compatibility" with older releases.
I am wondering should the portable be a sub-option for -Xaot instead of -Xshareclasses, because portability is the characteristic of AOT code, not SCC as such and SCC can be used without AOT as well.
Part of portability might also involve restricting the compressedrefs shift value (see https://github.com/eclipse/openj9/issues/7965), or other points in https://github.com/eclipse/openj9/issues/7710. Therefore, while it's mainly the JIT that's affected by it other JVM components need to be aware of a portable SCC.
Agreeing with where this discussion is going now, especially the part about being able to detect the case when we are in the build step for a container (if that ends up being feasible).
@zl-wang @gita-omr @mpirvu @vijaysun-omr I'm looking at enabling Portable AOT on Power. I have a few questions on how this may work on Power as its processor detection and compatibility check logic is a little bit different from x86 and Z. First I'll provide some background information:
OMR::CPU::_processorDescription
structOMR::CPU::_processorDescription
contains 2 piece of information: 1. the type of the processor 2. a set of processor feature flagsvoid
J9::Z::CPU::applyUserOptions()
{
if (_processorDescription.processor < OMR_PROCESSOR_S390_Z14)
{
omrsysinfo_processor_set_feature(&_processorDescription, OMR_FEATURE_S390_MISCELLANEOUS_INSTRUCTION_EXTENSION_2, FALSE);
omrsysinfo_processor_set_feature(&_processorDescription, OMR_FEATURE_S390_VECTOR_PACKED_DECIMAL, FALSE);
omrsysinfo_processor_set_feature(&_processorDescription, OMR_FEATURE_S390_VECTOR_FACILITY_ENHANCEMENT_1, FALSE);
omrsysinfo_processor_set_feature(&_processorDescription, OMR_FEATURE_S390_GUARDED_STORAGE, FALSE);
}
...
OMR_FEATURE_S390_MISCELLANEOUS_INSTRUCTION_EXTENSION_2
, OMR_FEATURE_S390_VECTOR_PACKED_DECIMAL
, OMR_FEATURE_S390_VECTOR_FACILITY_ENHANCEMENT_1
and OMR_FEATURE_S390_GUARDED_STORAGE
which may be set by the host cpubool
J9::Power::CPU::isCompatible(const OMRProcessorDesc& processorDescription)
{
OMRProcessorArchitecture targetProcessor = self()->getProcessorDescription().processor;
OMRProcessorArchitecture processor = processorDescription.processor;
// Backwards compatibility only applies to p4,p5,p6,p7 and onwards
// Looks for equality otherwise
if ((processor == OMR_PROCESSOR_PPC_GP
|| processor == OMR_PROCESSOR_PPC_GR
|| processor == OMR_PROCESSOR_PPC_P6
|| (processor >= OMR_PROCESSOR_PPC_P7 && processor <= OMR_PROCESSOR_PPC_LAST))
&& (targetProcessor == OMR_PROCESSOR_PPC_GP
|| targetProcessor == OMR_PROCESSOR_PPC_GR
|| targetProcessor == OMR_PROCESSOR_PPC_P6
|| targetProcessor >= OMR_PROCESSOR_PPC_P7 && targetProcessor <= OMR_PROCESSOR_PPC_LAST))
{
return targetProcessor >= processor;
}
return targetProcessor == processor;
}
bool
J9::X86::CPU::isCompatible(const OMRProcessorDesc& processorDescription)
{
for (int i = 0; i < OMRPORT_SYSINFO_FEATURES_SIZE; i++)
{
// Check to see if the current processor contains all the features that code cache's processor has
if ((processorDescription.features[i] & self()->getProcessorDescription().features[i]) != processorDescription.features[i])
return false;
}
return true;
}
Do you think we should take the x86 approach where we manually define a set of processor features for the portable processor feature set or take the Z approach where we need logic to disable features when we downgrade to an older cpu.
in my opinion we should follow the Z approach unlike x86 we have debug options in Power just like Z that allows us to downgrade processors. On Z we downgrade processor then disable feature accordingly but on Power we only downgrade processor. What's the reason behind not disabling features after downgrade processors?
What makes things more complicated on Power is that it seems currently we are only looking at the processor type for compatibility check. Should we be looking at the processor feature flags instead?
On POWER, it should be similar to Z approach: lower processor-type plus enabled features. For example, Transaction Memory -- it doesn't depend on the hardware strictly, it also depends on whether the OS enables it or not. You cannot determine its availability solely by processor-type.
The general principle of hardware is later generations of CPU are compatible with earlier generations of hardware, ISA-wise, except very few exceptions between far-away generations (for deprecated instructions, e.g.).
On POWER, it should be similar to Z approach: lower processor-type plus enabled features. For example, Transaction Memory -- it doesn't depend on the hardware strictly, it also depends on whether the OS enables it or not. You cannot determine its availability solely by processor-type.
The general principle of hardware is later generations of CPU are compatible with earlier generations of hardware, ISA-wise, except very few exceptions between far-away generations (for deprecated instructions, e.g.).
Okay so processor features is determined by both hardware and os, this makes sense. Another question is that is it true for Power that if the host environment (hardware + os) contains all the processor features that the build environment has then we can run the AOT code(from the build environment) on the host environment?
The processor feature set contained in processorDescription should be what's actually available and not what could be available based on the processor type. We should take into account of the OS when we initialize the processor feature set. After that, we can just safely compare the processor feature set similar to how we are doing it for x86. I'm hoping this works for Power.
Okay so processor features is determined by both hardware and os, this makes sense. Another question is that is it true for Power that if the host environment (hardware + os) contains all the processor features that the build environment has then we can run the AOT code(from the build environment) on the host environment?
Yes, that is expected.
This issue is to discuss whether or not it makes sense to define a set of processor features the compiler should target when generating AOT code. The three general approaches we can take are:
The question of what a compiler like GCC does came up in the discussion. Looking online, my understanding is that by default GCC compiles for the target it itself was compiled to target:
GCC will only target the CPU it is running on if
-march=native
is specified [2][1] https://wiki.gentoo.org/wiki/GCC_optimization [2] https://wiki.gentoo.org/wiki/Distcc#-march.3Dnative