eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.29k stars 722 forks source link

Portable SCC: Define Default Processor Features #7966

Open dsouzai opened 4 years ago

dsouzai commented 4 years ago

This issue is to discuss whether or not it makes sense to define a set of processor features the compiler should target when generating AOT code. The three general approaches we can take are:

  1. The compiler has a set of processor features defined for an AOT compile.
  2. The compiler targets whatever features are available on the machine it is running on (while turning off certain features such as TM).
  3. The compiler maintains a mapping of CPU version to CPU features; even though certain platforms don't have CPU features that are tied to any particular CPU version, the compiler can enforce such a mapping to simply user experience (rather than the user having to specify all features they want).

The question of what a compiler like GCC does came up in the discussion. Looking online, my understanding is that by default GCC compiles for the target it itself was compiled to target:

The only difference in behavior between two GCC versions built targeting different sub-architectures is the implicit default argument for the -march parameter, which is derived from the GCC's CHOST when not explicitly provided in the command line. [1]

GCC will only target the CPU it is running on if -march=native is specified [2]


[1] https://wiki.gentoo.org/wiki/GCC_optimization [2] https://wiki.gentoo.org/wiki/Distcc#-march.3Dnative

dsouzai commented 4 years ago

fyi @fjeremic @andrewcraik @gita-omr @knn-k

dsouzai commented 4 years ago

My personal preference is 1. Targeting the processor the SCC is generated on is currently what happens but that puts the burden on the user to find an old enough machine. I also do not prefer having a mapping between CPU version and some set of CPU features because it doesn't reflect the reality of which CPU features are available on which CPU version, and is a non-standard mapping that has be maintained and documented.

DanHeidinga commented 4 years ago

I'm fine starting with a system that specifies all the features to target so we have something working.

From a usability perspective, it won't take long for us (and end users!) to need a way to compactly specify groups of features. If this is a quasi-mapping to process type as proposed in [3], great. Some other logical way to group features together is fine by me as well.

andrewcraik commented 4 years ago

I think grouping based on the hardware platform only makes sense when the platform itself defines such a logical grouping. For example, the aarch64 architecture defines groups of features as a package and optional packages can be included or not in an implementation - grouping our queries to match these architectural groupings makes sense. On a platform like x86 where feature flags are used to determine feature support and feature support is not necessarily tied to a generation across manufacturers then trying to tie these features to a generation / hardware feature group makes less sense. As a result, option 3 is not the right general solution in my mind.

I'm not sure if 1 or 2 is right or if we should just define the 'default' AOT configuration and provide documentation and help on how to enable/disable features. Logical groupings based on what the compiler is accelerating or similar might make sense, but artificial hardware groupings not reflected in the underlying hardware platform (eg mapping features to processor levels when the feature is not truly tied to the processor level) seems counter productive.

I agree the usability story does need some consideration/attention. Given a major use case is a docker image or similar are there any container properties we could use to help figure out what the base config should be?

Another option might be something like the Xtrace and Xdump builders to help build an AOT hardware config?

DanHeidinga commented 4 years ago

How does logical groupings differ from the mtune/march settings of GCC?

We have to keep the goal in mind - making it easy for a user to create a portable SCC which includes AOT code that will be broadly applicable. Mapping features to logical groups, whether mtune/march-style or our own creation, gives users a reasonable way to control the baseline systems they want to target.

User's don't understand, and don't want to understand, the right set of flags to enable for the hardware they are deploying on. They will, at the most, know they are targeting "Haswell" processors, or "Skylake", or .... when they buy their instances from cloud providers. They just want their AOT code to work in that world, even if it's not the fastest they could get, as they don't control the hardware.

Another option might be something like the Xtrace and Xdump builders to help build an AOT hardware config?

This sounds a lot like having pre-built configurations :)

dsouzai commented 4 years ago

After taking a closer look at gcc's march option, I don't think we should follow that approach. GCC does maintain a mapping between processor version and some set of processor features. However, GCC is widely used, and hence the mapping they maintain can be more or less considered standard. I would rather not define a new mapping that's only applicable to us. I also don't want to have to depend on GCC's mapping.

From a usability perspective, it won't take long for us (and end users!) to need a way to compactly specify groups of features.

I'm not convinced that's true. This is something we're only trying to define for AOT code. As you said above:

User's don't understand, and don't want to understand, the right set of flags to enable for the hardware they are deploying on.

They just want their AOT code to work in that world, even if it's not the fastest they could get, as they don't control the hardware.

I don't see why anyone would care whether the AOT generated code is targeting an ivybridge machine even though the JVM is running on say a skylake, so long as they get the benefits.

Having a single definition makes it easier to document and makes it consistent no matter where the SCC is generated (portability being the main goal we're after here). JIT code is still going to target the machine the JVM is running on, so the idea here is the same as always: AOT code gets your app started fast, JIT recompilation gets your app's steady state performance fast.

The set of default features I'm thinking shouldn't be targeting something as old as say core2duo or P4. We can pick some reasonable set of features that should exist on most machines today, and we can easily add downgrading logic to take care of what happens when some features don't.

harryyu1994 commented 4 years ago

We now have the infrastructure to specify processor types on the fly for each compilation. It's time to decide on the actual set of portable AOT processor defaults for each platform. @dsouzai @vijaysun-omr @mpirvu @andrewcraik @fjeremic @gita-omr . Could you guys please have some discussion to get this going? Thanks!

Also Marius suggested that we should be able to specify processor via command line options.

harryyu1994 commented 4 years ago

Here's the list of processors:

/* List of all processors that are currently supported by OMR's processor detection */

typedef enum OMRProcessorArchitecture {

    OMR_PROCESSOR_UNDEFINED,
    OMR_PROCESSOR_FIRST,

    // 390 Processors
    OMR_PROCESSOR_S390_FIRST = OMR_PROCESSOR_FIRST,
    OMR_PROCESSOR_S390_UNKNOWN = OMR_PROCESSOR_S390_FIRST,
    OMR_PROCESSOR_S390_GP6,
    OMR_PROCESSOR_S390_Z10 = OMR_PROCESSOR_S390_GP6,
    OMR_PROCESSOR_S390_GP7,
    OMR_PROCESSOR_S390_GP8,
    OMR_PROCESSOR_S390_GP9,
    OMR_PROCESSOR_S390_Z196 = OMR_PROCESSOR_S390_GP9,
    OMR_PROCESSOR_S390_GP10,
    OMR_PROCESSOR_S390_ZEC12 = OMR_PROCESSOR_S390_GP10,
    OMR_PROCESSOR_S390_GP11,
    OMR_PROCESSOR_S390_Z13 = OMR_PROCESSOR_S390_GP11,
    OMR_PROCESSOR_S390_GP12,
    OMR_PROCESSOR_S390_Z14 = OMR_PROCESSOR_S390_GP12,
    OMR_PROCESSOR_S390_GP13,
    OMR_PROCESSOR_S390_Z15 = OMR_PROCESSOR_S390_GP13,
    OMR_PROCESSOR_S390_GP14,
    OMR_PROCESSOR_S390_ZNEXT = OMR_PROCESSOR_S390_GP14,
    OMR_PROCESSOR_S390_LAST = OMR_PROCESSOR_S390_GP14,

    // ARM Processors
    OMR_PROCESSOR_ARM_FIRST,
    OMR_PROCESSOR_ARM_UNKNOWN = OMR_PROCESSOR_ARM_FIRST,
    OMR_PROCESSOR_ARM_V6,
    OMR_PROCESSOR_ARM_V7,
    OMR_PROCESSOR_ARM_LAST = OMR_PROCESSOR_ARM_V7,

    // ARM64 / AARCH64 Processors
    OMR_PROCESSOR_ARM64_FISRT,
    OMR_PROCESSOR_ARM64_UNKNOWN = OMR_PROCESSOR_ARM64_FISRT,
    OMR_PROCESSOR_ARM64_V8_A,
    OMR_PROCESSOR_ARM64_LAST = OMR_PROCESSOR_ARM64_V8_A,

    // PPC Processors
    OMR_PROCESSOR_PPC_FIRST,
    OMR_PROCESSOR_PPC_UNKNOWN = OMR_PROCESSOR_PPC_FIRST,
    OMR_PROCESSOR_PPC_RIOS1,
    OMR_PROCESSOR_PPC_PWR403,
    OMR_PROCESSOR_PPC_PWR405,
    OMR_PROCESSOR_PPC_PWR440,
    OMR_PROCESSOR_PPC_PWR601,
    OMR_PROCESSOR_PPC_PWR602,
    OMR_PROCESSOR_PPC_PWR603,
    OMR_PROCESSOR_PPC_82XX,
    OMR_PROCESSOR_PPC_7XX,
    OMR_PROCESSOR_PPC_PWR604,
    // The following processors support SQRT in hardware
    OMR_PROCESSOR_PPC_HW_SQRT_FIRST,
    OMR_PROCESSOR_PPC_RIOS2 = OMR_PROCESSOR_PPC_HW_SQRT_FIRST,
    OMR_PROCESSOR_PPC_PWR2S,
    // The following processors are 64-bit implementations
    OMR_PROCESSOR_PPC_64BIT_FIRST,
    OMR_PROCESSOR_PPC_PWR620 = OMR_PROCESSOR_PPC_64BIT_FIRST,
    OMR_PROCESSOR_PPC_PWR630,
    OMR_PROCESSOR_PPC_NSTAR,
    OMR_PROCESSOR_PPC_PULSAR,
    // The following processors support the PowerPC AS architecture
    // PPC AS includes the new branch hint 'a' and 't' bits
    OMR_PROCESSOR_PPC_AS_FIRST,
    OMR_PROCESSOR_PPC_GP = OMR_PROCESSOR_PPC_AS_FIRST,
    OMR_PROCESSOR_PPC_GR,
    // The following processors support VMX
    OMR_PROCESSOR_PPC_VMX_FIRST,
    OMR_PROCESSOR_PPC_GPUL = OMR_PROCESSOR_PPC_VMX_FIRST,
    OMR_PROCESSOR_PPC_HW_ROUND_FIRST,
    OMR_PROCESSOR_PPC_HW_COPY_SIGN_FIRST = OMR_PROCESSOR_PPC_HW_ROUND_FIRST,
    OMR_PROCESSOR_PPC_P6 = OMR_PROCESSOR_PPC_HW_COPY_SIGN_FIRST,
    OMR_PROCESOSR_PPC_ATLAS,
    OMR_PROCESSOR_PPC_BALANCED,
    OMR_PROCESSOR_PPC_CELLPX,
    // The following processors support VSX
    OMR_PROCESSOR_PPC_VSX_FIRST,
    OMR_PROCESSOR_PPC_P7 = OMR_PROCESSOR_PPC_VSX_FIRST,
    OMR_PROCESSOR_PPC_P8,
    OMR_PROCESSOR_PPC_P9,
    OMR_PROCESSOR_PPC_LAST = OMR_PROCESSOR_PPC_P9,

    // X86 Processors
    OMR_PROCESSOR_X86_FIRST,
    OMR_PROCESSOR_X86_UNKNOWN = OMR_PROCESSOR_X86_FIRST,
    OMR_PROCESSOR_X86_INTEL_FIRST,
    OMR_PROCESSOR_X86_INTELPENTIUM = OMR_PROCESSOR_X86_INTEL_FIRST,
    OMR_PROCESSOR_X86_INTELP6,
    OMR_PROCESSOR_X86_INTELPENTIUM4,
    OMR_PROCESSOR_X86_INTELCORE2,
    OMR_PROCESSOR_X86_INTELTULSA,
    OMR_PROCESSOR_X86_INTELNEHALEM,
    OMR_PROCESSOR_X86_INTELWESTMERE,
    OMR_PROCESSOR_X86_INTELSANDYBRIDGE,
    OMR_PROCESSOR_X86_INTELIVYBRIDGE,
    OMR_PROCESSOR_X86_INTELHASWELL,
    OMR_PROCESSOR_X86_INTELBROADWELL,
    OMR_PROCESSOR_X86_INTELSKYLAKE,
    OMR_PROCESSOR_X86_INTEL_LAST = OMR_PROCESSOR_X86_INTELSKYLAKE,
    OMR_PROCESSOR_X86_AMD_FIRST,
    OMR_PROCESSOR_X86_AMDK5 = OMR_PROCESSOR_X86_AMD_FIRST,
    OMR_PROCESSOR_X86_AMDK6,
    OMR_PROCESSOR_X86_AMDATHLONDURON,
    OMR_PROCESSOR_X86_AMDOPTERON,
    OMR_PROCESSOR_X86_AMDFAMILY15H,
    OMR_PROCESSOR_X86_AMD_LAST = OMR_PROCESSOR_X86_AMDFAMILY15H,
    OMR_PROCESSOR_X86_LAST = OMR_PROCESSOR_X86_AMDFAMILY15H,

    OMR_PROCESOR_RISCV32_UNKNOWN,
    OMR_PROCESOR_RISCV64_UNKNOWN,

    OMR_PROCESSOR_DUMMY = 0x40000000 /* force wide enums */

} OMRProcessorArchitecture;

Refer to omr/include_core/omrport.h for the feature flags.

mpirvu commented 4 years ago

I am thinking that being able to select the features for AOT through command line options is still important. In some instances, the IT people may know that the JVM is not going to run on machines older than X (pick your architecture) and may want to target that architecture as the baseline. Therefore I am in favor of supporting the logical grouping @DanHeidinga mentioned. To get this going off the ground we could have a single grouping to start with and gradually add new groupings targeting newer architectures.

vijaysun-omr commented 4 years ago

@harryyu1994 just so I'm clear on what you are expecting when you said It's time to decide on the actual set of portable AOT processor defaults for each platform.... did you mean we should pick the default processor for each platform from the lists that you pasted in your last comment ?

One approach could be to pick some processor that reasonably old such that a large proportion of users can reasonably be expected to have something newer than that and then force the codegen to assume that processor type and see how much of a regression you get from handicapping the codegen in this way before deciding if we should go ahead or not. Is this is the approach you were also thinking of and if so, were you essentially looking for someone familiar with the different codegens to make a processor suggestion for their platform ?

harryyu1994 commented 4 years ago

@harryyu1994 just so I'm clear on what you are expecting when you said It's time to decide on the actual set of portable AOT processor defaults for each platform.... did you mean we should pick the default processor for each platform from the lists that you pasted in your last comment ?

Yes, we should pick default processor for each platform from the list I pasted as well as default features (for x86).

One approach could be to pick some processor that reasonably old such that a large proportion of users can reasonably be expected to have something newer than that and then force the codegen to assume that processor type and see how much of a regression you get from handicapping the codegen in this way before deciding if we should go ahead or not. Is this is the approach you were also thinking of and if so, were you essentially looking for someone familiar with the different codegens to make a processor suggestion for their platform ?

Yes, I'm looking for processor suggestions from people.

mpirvu commented 4 years ago

For x86 I am proposing OMR_PROCESSOR_X86_INTELSANDYBRIDGE, to be the baseline for relocatable code. It's a 9-year old architecture that has AVX and AES instructions. If at all possible I would like this baseline to work on both Intel and AMD processors as we start to see more and more AMD EPYC instances in the cloud.

vijaysun-omr commented 4 years ago

Sounds reasonable to me, though I guess the true test will be a performance run to see how much we lose by assuming this older level of architecture on a current machine, e.g. Skylake.

harryyu1994 commented 4 years ago

For Z and Power, just a single processor type would be sufficient as the feature flags are set based on the processor type. For x86, we need the set of feature flags as well. (The processor type may not matter that much)

We need to come up with a mapping of processor type to feature flags for x86

Note to self: I need to watch out for the few instances that processor type do matter on x86, also need to look into if it's possible for the baseline to work on both intel and AMD.

mpirvu commented 4 years ago

These are the flags listed for my machine which uses ivybridge CPUs:

Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm arat pln pts md_clear spec_ctrl intel_stibp flush_l1d

I am reading that ivybridge added rdrand and F16C instructions on top of sandybridge, so we should exclude those from the list above. We should care only about the set of the flags that the optimizer is trying to exploit though.

harryyu1994 commented 4 years ago

We should care only about the set of the flags that the optimizer is trying to exploit though.

In my upcoming changes:

// Only enable the features that compiler currently uses
   uint32_t enabledFeatures [] = {OMR_FEATURE_X86_FPU, OMR_FEATURE_X86_CX8, OMR_FEATURE_X86_CMOV,
                                  OMR_FEATURE_X86_MMX, OMR_FEATURE_X86_SSE, OMR_FEATURE_X86_SSE2,
                                  OMR_FEATURE_X86_SSSE3, OMR_FEATURE_X86_SSE4_1, OMR_FEATURE_X86_POPCNT,
                                  OMR_FEATURE_X86_AESNI, OMR_FEATURE_X86_OSXSAVE, OMR_FEATURE_X86_AVX,
                                  OMR_FEATURE_X86_FMA, OMR_FEATURE_X86_HLE, OMR_FEATURE_X86_RTM};

We maintain this array that contains all the features that the optimizer tries to exploit We will mask out all the features that we don't care about.

harryyu1994 commented 4 years ago

Had some offline discussion with Marius and here's some notes

  1. Check user command line options for the default processor they want
  2. If AOT is already available in the SCC, we want to use the processor in the SCC instead of what user specifies. Output a warning message to inform user that their processor wasn't used
  3. We can potentially produce AOT code with a newer processor than host. (In this case, we can produce code but we can't run on host) The question here is should we disable this or allow this.
mpirvu commented 4 years ago

These are the features present in enabledFeatures [] that a sandybridge architecture does not have:

OMR_FEATURE_X86_OSXSAVE --> OS has enabled XSETBV/XGETBV instructions to access XCR0
OMR_FEATURE_X86_FMA --> FMA extensions using YMM state
OMR_FEATURE_X86_HLE  -> Hardware lock elision
OMR_FEATURE_X86_RTM -> Restricted transactional memory
vijaysun-omr commented 4 years ago

@harryyu1994 for your 3rd "note" did you mean that

a) the default processor level is newer than the host, i.e. it is some really old machine OR b) the user can specify an option to produce AOT code for a newer processor than the host

And does "disabling" mean silently not generating AOT code in that scenario OR something like reporting a usage error of some sort ?

DanHeidinga commented 4 years ago

2. If AOT is already available in the SCC, we want to use the processor in the SCC instead of what user specifies. Output a warning message to inform user that their processor wasn't used

I'm reading this to mean there is 1 processor defined for the SCC. Does it make sense to allow different layers of a multi-layer SCC to define a different, more restrictive (ie: newer), processor level?

We should agree on whether this is a desirable principle rather than worry about the details now.

harryyu1994 commented 4 years ago

@harryyu1994 for your 3rd "note" did you mean that

a) the default processor level is newer than the host, i.e. it is some really old machine OR b) the user can specify an option to produce AOT code for a newer processor than the host

And does "disabling" mean silently not generating AOT code in that scenario OR something like reporting a usage error of some sort ?

@vijaysun-omr I meant b). I was thinking about reporting a usage error to user. Would we ever have a use case where the user only wants to generate AOT compile for a certain processor level? So basically preparing the SCC for others.

I'm reading this to mean there is 1 processor defined for the SCC. Does it make sense to allow different layers of a multi-layer SCC to define a different, more restrictive (ie: newer), processor level?

We should agree on whether this is a desirable principle rather than worry about the details now.

@DanHeidinga

My understanding of the multi-layer cache is that it's for storing SCCs in docker images. Basically each layer of the docker image will want to have its own SCC. So for multi-layers I was thinking something like this: The base layer will have the least restrictive processor and the outermost layer will have the most restrictive processor. When we add another layer, as long as it's equivalent or more restrictive than the layer below it we are going to allow it. Or maybe everything in docker should use the lowest possible processor settings to maximize portability.

What I meant in the original notes was for a different scenario: (so basically only considering the current outermost layer) We already have a SCC that the JVM can run with. Then if the user wants to change it to use to a different processor, maybe we want to reject that operation? Or maybe we want to treat it as an overwrite operation where we ditch the original AOT code in the SCC and generate new code. (Not sure if this is feasible, user could just delete the SCC and generate a new one)

DanHeidinga commented 4 years ago

We already have a SCC that the JVM can run with. Then if the user wants to change it to use to a different processor, maybe we want to reject that operation? Or maybe we want to treat it as an overwrite operation where we ditch the original AOT code in the SCC and generate new code. (Not sure if this is feasible, user could just delete the SCC and generate a new one)

Another approach would be to associate the AOT code with the Processor it requires. This would allow mixing AOT with different Processor requirements in the same cache. Not terribly useful when running on a host system but possibly more useful when a cache is being shipped around in Docker or other ways

vijaysun-omr commented 4 years ago

In relation to comment : https://github.com/eclipse/openj9/issues/7966#issuecomment-634234270

I feel it is okay to report a usage error in the case when a user can specify an option to produce AOT code for a newer processor than the host. If this functionality is deemed important in the future, it can be added at that time but I don't see the need to do this work now.

vijaysun-omr commented 4 years ago

I do see merit in @harryyu1994 comment "The base layer will have the least restrictive processor and the outermost layer will have the most restrictive processor. When we add another layer, as long as it's equivalent or more restrictive than the layer below it we are going to allow it. ", i.e. philosophically this may be something to allow.

In practical/implementation terms, I wonder if this is a use case that we support down the line rather than get bogged down with at present.

DanHeidinga commented 4 years ago

I do see merit in @harryyu1994 comment "The base layer will have the least restrictive processor and the outermost layer will have the most restrictive processor. When we add another layer, as long as it's equivalent or more restrictive than the layer below it we are going to allow it. ", i.e. philosophically this may be something to allow.

In practical/implementation terms, I wonder if this is a use case that we support down the line rather than get bogged down with at present.

How usable is this feature without this?

My mental model is that the docker image may be built up by different parts of the CI at different times on different machines (lots of variability in that process)!

A user may pull an Adopt created docker image with a default cache in it and then in their own CI build a common framework layer with a new cache layer. Finally, each app may reuse that image and add their own classes into the cache.

If all three of those operations happen on different machines, we need to either "pin" the processor level to the one created in the JDK base image (ie: the Adopt layer) or allow each layer to add further restrictions.

A user doesn't want to have a bigger docker image due to AOT code they can't execute.

Oh, and we'll need some API, maybe in the SCC print-stats option to tell the current processor level of the cache so later layers can specify exactly the same one.

vijaysun-omr commented 4 years ago

I thought the proposal was still to settle on some reasonably old arch level (e.g. ivybridge on X86) such that "by default" the risk of adding AOT code that would not run is fairly low at any layer of the SCC.

This default arch level only changes if/when a user explicitly starts upgrading the arch version by specifying an option because they know something about the servers being targetted, but this was'nt expected to be particularly common in practice (at least that is what I thought).

dsouzai commented 4 years ago

I think there are many ways we can improve the usability of the SCC in the Docker situation. However, it is going to be a non-trivial amount of work. Therefore, I think the approach we should take is:

  1. Limit AOT to some "old"-ish arch level. This will make AOT very portable right now.
  2. Add an option to have the AOT target be the same as the JIT target
  3. Introduce the ability for the user to provide the arch they wish AOT to target.
  4. Introduce AOT Headers per SCC layer

1 and 2 will solve the immediate limitations we have right now. 3 will facilitate the ability to do 4. Once we have all 4 of these in place, then we can do what @DanHeidinga would like to see, namely different arch targets at different layers of the SCC.

~The risk of different arch targets at different layers of the SCC means that depending on where you're running, the JVM might not be able to use all the code in the SCC.~ Dan addressed this above:

If all three of those operations happen on different machines, we need to either "pin" the processor level to the one created in the JDK base image (ie: the Adopt layer) or allow each layer to add further restrictions.

Regarding:

The base layer will have the least restrictive processor and the outermost layer will have the most restrictive processor.

This would be relatively trivial to do once we have 1-4 above implemented.

DanHeidinga commented 4 years ago

That sounds like a reasonable approach.

mpirvu commented 4 years ago

To make sure we agree on what is going to be implemented as a first step I am detailing the workflow for (1) and (2):

  1. Define a default processor architecture for AOT, let's call it P2
  2. Implement option to override P2 and use the architecture for the host processor (to be able to revert to today's behavior)
  3. If user generates AOT on a P2 arch or newer (P3, P4, ...), the AOT code will target P2 (common case)
  4. If user generates AOT on a P1 arch (older) 4.1 If SCC is empty, fail the JVM and ask user to either 4.1.1 Use the option from step (2) to generate code for current host, or 4.1.2 Use a more recent CPU (at least at P2 level) 4.2 If SCC already contains some AOT 4.2.1 If existing AOT is at P1 level, generate AOT at P1 level 4.2.2 If existing AOT is at P2 or newer level, fail the JVM. Tell the user to use a host with a processor at P2 or higher level.

At step 4.1 it's debatable whether to fail the JVM with an error message (like I proposed) or silently compile for the current host. I feel that the latter solution, while improving consumability, may create silent performance problems.

dsouzai commented 4 years ago

4.1 is the only point that concerns me, but I can see an argument for either direction, so I'm ok with whichever choice we take.

DanHeidinga commented 4 years ago

Writing this out explicitly as I had to think about it for a while before coming to the conclusion that Marius's workflow is the right one.

With AOT in the SCC, there are two distinct use cases:

The proposed workflow makes the deployment easier for the second case by making portability the default. It trades performance (tbd) in the local machine case, which can be restored by an option, for the easier portability.

Marius's flow, at a high level, is:

4.1 If SCC is empty, fail the JVM and ask user to either

This is the right default as it's easier to make an error case legal in the future than to turn a legal case into an error.

In the future, we'll need work thru how this changes when specifying an explicit processor level different than the portability processor or current host.

vijaysun-omr commented 4 years ago

Is'nt step 4.1 when running SCC on a local machine (i.e. not for packaging into docker) turning something that used to be legal into an error condition ?

mpirvu commented 4 years ago

Is'nt step 4.1 when running SCC on a local machine (i.e. not for packaging into docker) turning something that used to be legal into an error condition ?

Yes, which is why this point is contentious. The scenario it tries to avoid is the following:

This problem would not appear if we had an automatic way of figuring it out whether the SCC is for the local machine or for a container.

vijaysun-omr commented 4 years ago

I'll mention this since we are still designing the solution here and see what you think.

Rather than break existing users who are building an SCC on a local machine that is very old, should we consider an option (say -Xshareclasses:portable) that when specified would enable the portable SCC logic ?

The advantage of this would be that we do not break existing users in the local machine scenario and for those building in an SCC into their containers (a relatively new use case that we are supporting) it becomes recommended (but not mandatory) to run with the "portable" option in the build step. If they did not specify the "portable" option, they get a non-portable SCC with the set of problems that we have today before Harry's work. i.e. it does not make any existing use case worse while still offering a way to improve the use case that is problematic currently, namely building an SCC into the container image.

If we went this route, we would obviously specify the "portable" option when we are building our own SCC in the OpenJ9 docker image and also in any other cases we are aware of where an SCC is being packaged into a docker image, e.g. Open Liberty.

DanHeidinga commented 4 years ago

Rather than break existing users who are building an SCC on a local machine that is very old, should we consider an option (say -Xshareclasses:portable) that when specified would enable the portable SCC logic ?

It depends on how old the default processor we pick will be. How many users (rough estimate) will be affected by this change?

I started to write out an argument about why needing a sub-option is a worse user experience and then had to catch myself because the SCC already require a lot of sub-options: usually cache name, cachedir, sizes, and in the case of the MultiLayer cache, createLayer.

Adding one more "portable" isn't unreasonable to avoid breaking any existing users. The case I'd want to check is when building a Docker image, are the RUN commands considered by the JVM to be inside a container? If they are, we could default to being portable when running in a docker container as well.

With this proposal, the high level flow becomes:

mpirvu commented 4 years ago

are the RUN commands considered by the JVM to be inside a container?

@ashu-mehra Do you know the answer to this question? If not could you please tell @harryyu1994 the API to detect container presence so that Harry could do a small experiment?

The user could also be generating the SCC/AOT on the side and simply copying the SCC file into the container. I've done that in past in my small experiments, but I don't know whether that is common practice.

ashu-mehra commented 4 years ago

are the RUN commands considered by the JVM to be inside a container?

I believe so. For every RUN command a new temporary container is spawned.

Port library API to detect container presence is omrsysinfo_is_running_in_container [1], although it just returns the flag which actually gets set during port library startup [2].


[1] https://github.com/eclipse/omr/blob/cbc988cc7e82cd80a608492bc13e4ced64c744e1/port/unix/omrsysinfo.c#L5637 [2] https://github.com/eclipse/omr/blob/cbc988cc7e82cd80a608492bc13e4ced64c744e1/port/unix/omrsysinfo.c#L3081

ashu-mehra commented 4 years ago

I am wondering should the portable be a sub-option for -Xaot instead of -Xshareclasses, because portability is the characteristic of AOT code, not SCC as such and SCC can be used without AOT as well.

DanHeidinga commented 4 years ago

I am wondering should the portable be a sub-option for -Xaot instead of -Xshareclasses, because portability is the characteristic of AOT code, not SCC as such and SCC can be used without AOT as well.

Users are already specifying -Xshareclasses for the kinds of portable docker environments we're discussing here so I would prefer to keep this as a sub-option of the -Xshareclasses rather than introducing them to a new, also complicated, -Xaot option.

I'm open to supporting a -XX:[+-]PortableSCC option in addition to -Xshareclasses:portable given the better experience with -XX options due to them being ignored by default which allows some minimal command line "compatibility" with older releases.

dsouzai commented 4 years ago

I am wondering should the portable be a sub-option for -Xaot instead of -Xshareclasses, because portability is the characteristic of AOT code, not SCC as such and SCC can be used without AOT as well.

Part of portability might also involve restricting the compressedrefs shift value (see https://github.com/eclipse/openj9/issues/7965), or other points in https://github.com/eclipse/openj9/issues/7710. Therefore, while it's mainly the JIT that's affected by it other JVM components need to be aware of a portable SCC.

vijaysun-omr commented 4 years ago

Agreeing with where this discussion is going now, especially the part about being able to detect the case when we are in the build step for a container (if that ends up being feasible).

harryyu1994 commented 4 years ago

@zl-wang @gita-omr @mpirvu @vijaysun-omr I'm looking at enabling Portable AOT on Power. I have a few questions on how this may work on Power as its processor detection and compatibility check logic is a little bit different from x86 and Z. First I'll provide some background information:

Background

How processor detection works now:

How Portable AOT works on x86:

How Portable AOT works on Z:

Questions

zl-wang commented 4 years ago

Do you think we should take the x86 approach where we manually define a set of processor features for the portable processor feature set or take the Z approach where we need logic to disable features when we downgrade to an older cpu.

in my opinion we should follow the Z approach unlike x86 we have debug options in Power just like Z that allows us to downgrade processors. On Z we downgrade processor then disable feature accordingly but on Power we only downgrade processor. What's the reason behind not disabling features after downgrade processors?

What makes things more complicated on Power is that it seems currently we are only looking at the processor type for compatibility check. Should we be looking at the processor feature flags instead?

On POWER, it should be similar to Z approach: lower processor-type plus enabled features. For example, Transaction Memory -- it doesn't depend on the hardware strictly, it also depends on whether the OS enables it or not. You cannot determine its availability solely by processor-type.

The general principle of hardware is later generations of CPU are compatible with earlier generations of hardware, ISA-wise, except very few exceptions between far-away generations (for deprecated instructions, e.g.).

harryyu1994 commented 4 years ago

On POWER, it should be similar to Z approach: lower processor-type plus enabled features. For example, Transaction Memory -- it doesn't depend on the hardware strictly, it also depends on whether the OS enables it or not. You cannot determine its availability solely by processor-type.

The general principle of hardware is later generations of CPU are compatible with earlier generations of hardware, ISA-wise, except very few exceptions between far-away generations (for deprecated instructions, e.g.).

Okay so processor features is determined by both hardware and os, this makes sense. Another question is that is it true for Power that if the host environment (hardware + os) contains all the processor features that the build environment has then we can run the AOT code(from the build environment) on the host environment?

harryyu1994 commented 4 years ago

The processor feature set contained in processorDescription should be what's actually available and not what could be available based on the processor type. We should take into account of the OS when we initialize the processor feature set. After that, we can just safely compare the processor feature set similar to how we are doing it for x86. I'm hoping this works for Power.

zl-wang commented 4 years ago

Okay so processor features is determined by both hardware and os, this makes sense. Another question is that is it true for Power that if the host environment (hardware + os) contains all the processor features that the build environment has then we can run the AOT code(from the build environment) on the host environment?

Yes, that is expected.