WASdev / ci.docker

Build scripts for Docker images (Dockerfiles) and Docker related utilities for WebSphere Liberty.
Apache License 2.0
124 stars 122 forks source link

Startup Times not as expected when building on top of websphere-liberty:kernel #387

Closed jdmcclur closed 3 years ago

jdmcclur commented 3 years ago

@ymanton is looking into this.

If I build an image on top of websphere-liberty:kernel and run configure.sh, the resulting startup time for the container is not as expected (not what we see with openliberty:kernel-slim). Something seems wrong with the SCC layers with IBM JDK, although they seem to build ok. Doing RUN configure.sh twice helps as a workaround.

ymanton commented 3 years ago

I've been trying lots of experiments, most of them failing to shed light on the problem. The most interesting one thus far is this:

RUN configure.sh
RUN configure.sh

improves startup time as @jdmcclur noted, but

RUN configure.sh; configure.sh

doesn't. Those two are functionally equivalent except at the Docker/filesystem layer. There's no obvious reason why one would perform better than the other as far as the app server and JVM are concerned.

vijaysun-omr commented 3 years ago

@mpirvu fyi so you can follow along

ymanton commented 3 years ago

I don't think this has anything to do with SCCs or AOT. I conducted the following experiments to arrive at this conclusion:

  1. Run configure once as usual, then run the server again with -Xnoaot to create a new SCC layer that does not have any AOT code or hints. Result: fast startup. So AOT code and hints in the extra layer were not contributing to improving startup.

    RUN configure.sh
    RUN IBM_JAVA_OPTIONS="-XX:+OriginalJDK8HeapSizeCompatibilityMode -Xshareclasses:name=liberty,nonfatal,cacheDir=/output/.classCache/,createLayer -Xnoaot" server start && server stop
  2. Extract the 3 SCC layers from the above image and run them against a regular (slow) image. Result: slow startup. The contents of the "fast" SCC don't improve startup.

  3. Extract the 2 SCC layers from a regular (slow) image and run them against an image known to start fast: Result: fast startup. The contents of the SCC "slow" SCC don't degrade startup.

Hence, the SCC doesn't appear to be a factor.

Profiling container startup shows me that everything that typically runs during startup (JIT, VM, JIT compiled code, GC, etc) collects more ticks in the slow scenario compared to the fast scenario, so there's no smoking gun yet.

vijaysun-omr commented 3 years ago

That is puzzling. My understanding is that Open Liberty does not suffer from this issue. Is there something to be learned from comparing how the image is built in that case to see if that sheds some light into this WAS Liberty problem ?

vijaysun-omr commented 3 years ago

Another option is to check the messages emanating from Liberty (enable extra tracing if needed) in the fast and slow cases and compare, e.g. to see if we are somehow loading more bundles/classes in the slow case. This could then show up as "more time in everything" (as you seem to be observing) simply because there is more work to do.

jdmcclur commented 3 years ago

I did some comparing to open-liberty, which seems to run fine. open-liberty does things a little differently; it uses a features.sh script separate from configure.sh to install features, while websphere-liberty calls installUtility inside configure.sh.

RUN features.sh
RUN configure.sh

I found that when I separate out the installUtility call or even do it before configure.sh (so it still runs during configure.sh, but doesn't install any new features), startup is as expected.

RUN installUtility install --acceptLicense defaultServer
RUN configure.sh

Also, I found that in open-liberty if I put the features.sh call into configure.sh, I see the same problem (slow startup), so this is not specific to websphere-liberty.

Installing the features in the same layer as the scc is generated seems to be the issue? I don't know why yet, nor why it only affects ibmjava and not openj9.

vijaysun-omr commented 3 years ago

Do you see any difference wrt the feature installation if you disabled shared classes (-Xshareclasses:none) or disabled AOT (-Xnoaot) ? i.e. is there still an SCC connection ? One thing to check would be whether the AOT code gets used or not via a -Xjit:verbose.vlog option passed to the server in the slow case (compared to the fast case).

jdmcclur commented 3 years ago

Vijay, do you mean running installUtility/featureUtility to download the features? Those run with -Xshareclasses:none, so they are not affecting the SCC (this used to be a problem, but was fixed awhile ago).

vijaysun-omr commented 3 years ago

No I just meant if you commented out the code in configure.sh to populate the SCC and repeat the steps (around how you installed features in the fast vs slow case) that you mentioned in your prior comment, do you still see a regression in startup time ? i.e. do we believe the regression still does have some connection to the SCC step in configure.sh ?

ymanton commented 3 years ago

Open Liberty is built the same way as far as the SCC is concerned. The differences between the two are in their respective implementations as @jdmcclur noted. There's not much to suggest that this startup delta has anything to do with the SCC, rather the program is behaving differently.

I think @vijaysun-omr was on the right track earlier, more work is being done. Part of the server state is persistent and stored on disk (particularly in the workarea directory). Every time the server runs it accesses and makes changes to the disk. We don't clean up the workarea during image build so those changes get baked into the Docker image, which means that we're not really comparing startup from the same state. Based on the contents of the workarea directory I see evidence of caches and jar files and such, so it's likely that the state of this directory has a direct impact on startup. If I delete this directory at the end of the image build process startup is worse but the difference between images disappears.

I ran strace during startup and compared the output for a slow (default) build and a fast (using one of the above workarounds) build and it clearly shows that in the slow case the process is accessing more application files in the /opt directory. The timestamps in the strace output let us compare each of the startups in detail and we can see that in the slow case we actually are ahead of the pace but start to slow down when the process starts opening more files.

The "files opened" column is a cumulative count of syscalls to open files in the /opt directory, which is where the server is installed. It doesn't necessarily mean that the files are actually accessed, just that the process is attempting to open them. The key point is that these attempts are driven by the program itself, which points to fundamentally different behaviour.

Fast image Slow image
server msg timestamp files opened timestamp files opened
[AUDIT ] CWWKE0001I: The server defaultServer has been launched. 0.94051 235 0.917884 235
[AUDIT ] CWWKE0100I: This product is licensed for development 0.991712 238 0.935881 238
[AUDIT ] CWWKG0093A: Processing configuration drop-ins... 1.94067 301 2.91578 493
[AUDIT ] CWWKG0102I: Found conflicting settings for... 2.47598 352 7.88135 2009
[AUDIT ] CWWKZ0058I: Monitoring dropins for applications. 3.01314 391 8.50446 2078
[AUDIT ] CWWKT0016I: Web application ... /openapi/ui/ 6.32191 909 11.6998 2373
[AUDIT ] CWWKT0016I: Web application ... /ibm/api/ 6.32431 909 11.7024 2374
[AUDIT ] CWWKT0016I: Web application ... /openapi/ 6.32657 910 11.7229 2377
[AUDIT ] CWWKT0016I: Web application ... /metrics/ 6.3306 910 11.991 2411
[AUDIT ] CWWKT0016I: Web application ... /jwt/ 6.33377 910 12.2554 2447
[AUDIT ] CWWKT0016I: Web application ... /health/ 6.33562 910 12.7419 2516
[AUDIT ] CWWKT0016I: Web application ... /auth/ 11.9435 1202 18.7875 2766
[AUDIT ] CWWKZ0001I: Application acmeair-authservice started... 12.0022 1209 18.853 2770
[AUDIT ] CWWKF0012I: The server installed the following... 12.0545 1217 18.928 2780
[AUDIT ] CWWKF0011I: The defaultServer server is ready... 12.0567 1217 18.9293 2780

To confirm the above I compared startup between the two images with the SCC disabled and again with both the SCC and JIT disabled and in both cases a clear startup delta still exists. Without the SCC both images are the same from the JVM's POV so the only difference is in the persistent server state.

vijaysun-omr commented 3 years ago

Very good analysis, Younes and you touched on several of the points I had been curious about.

Joe, I assume you are investigating the workarea and maybe tracing the server and what it is doing based on the last comment.

jdmcclur commented 3 years ago

Yes, investigating, but nothing is standing out so far between the images. I will ask what trace to enable to see what the workarea is doing.

jdmcclur commented 3 years ago

I took come config trace. In the good/fast start case, I see:

[2/20/21 3:25:16:135 UTC] 00000036 id=00000000 com.ibm.ws.config.xml.internal.BundleProcessor               3 Processing cached bundle: org.eclipse.osgi_3.15.0.cl210120210113-1459 [0]
[2/20/21 3:25:16:135 UTC] 00000036 id=00000000 com.ibm.ws.config.xml.internal.BundleProcessor               3 Processing cached bundle: com.ibm.ws.artifact.loose_1.0.48.cl210120210113-1459 [24]
+ a lot more

In the slow start case, I don't see this, so it is not using the bundle cache. Now to figure out why.

jdmcclur commented 3 years ago

With Brent Daniel's help, I think I finally figured this out. It has to do with layers and time stamps (and differences between openj9 and ibm jdk).

When I run the following

FROM ibmcom/websphere-liberty:kernel-java8-ibmjava-ubi
RUN touch /config/test && stat /config/test
RUN stat /config/test

The first stat returns precision down to ms.

Modify: 2021-02-23 21:24:55.072488272 +0000

In the second, it is only seconds. Not sure why, but something with finishing the layer/the overlayfs/maybe moving the file?

Modify: 2021-02-23 21:24:55.0000000000 +0000

With IBM JDK file.lastModified() returns precision down to ms, which the cache uses to see if the cache is still valid. Timestamp stored: 1614115495072

However, the next time it runs the file last modified time is different because precision has been lost (as shown above). Timestamp of file on disk: 1614115495000

Since these don't match, it thinks the cache is invalid.

Running installUtility in the step before configure.sh works, because the bundle files have already been modified from finishing the layer before running configure.sh, so the timestamps are already in seconds, and they are cached "correctly." However there are other issues with the server configuration not cached correctly, but this does not seem to impact startup time much.

Also, with openJ9 file.lastModified() returns precision down to seconds, so everything just works.

vijaysun-omr commented 3 years ago

Wow, that is a one interesting story. Well done in debugging that "loss in precision" Joe.

What are we thinking we will be delivering as a fix then ? :)

ymanton commented 3 years ago

Good find. That explains why RUN configure.sh; configure.sh behaved differently than two separate RUN configure.sh lines.

We can't rely on better than 1-second precision. Some OS and filesystem combinations give you ms or ns precision, but it's not guaranteed. Apparently Docker images are an example. [Javadoc](https://docs.oracle.com/en/java/javase/15/docs/api/java.base/java/io/File.html#lastModified()) for File.lastModified() mentions the same thing.

Not sure what to make of the OpenJ9/IBM difference.

As for the fix, is the code saving a timestamp somewhere and later comparing it against a timestamp from the filesystem? If we're comparing timestamps for strict equality then we need to round both back to the most recent second before comparing. If we're checking for order I guess it depends on the semantics of the cache. Why are the timestamps being checked exactly?

jdmcclur commented 3 years ago

As I understand it, In the workarea we compare the timestamps to see if 1. any config has changed (server.xml, etc) and 2. If any bundles have changed and need reprocessing. Currently this is done based on whatever File.lastModified / the filesystem gives you, and yes, it looks like a strict comparison. Guessing it was designed not thinking the underlying filesystem would change?

Last we talked, Brent was going to look into ignoring the ms (round back to s) as a way to fix this.

jdmcclur commented 3 years ago

@ymanton - how does the SCC work for invalidation, does it use timestamps?

ymanton commented 3 years ago

Good thinking. IIRC timestamps are checked first. If they don't match we do a deeper comparison, which I imagine requires reading classes from disk and extracting them from jars, parsing them, and so on.

I'm not sure exactly how the timestamps are compared and whether or not they would suffer from precision mismatches, but we could be doing more work than necessary at the SCC level as well. I'll look into it.

ymanton commented 3 years ago

It looks like the SCC code has the same issue, we store last modified times and later compare them for equality, so loss of precision will cause such checks to fail and cause us to do more expensive checks. I don't see any evidence of this actually happening in this case though. We may be updating the last modified time in the SCC if the timestamp check fails but the subsequent checks pass. If so, that would explain why I'm not seeing the timestamp checks failing, i.e. the timestamps in the SCC have mostly been updated by the end of the build because we've started/stopped the server multiple times across Docker layers.

hangshao0 commented 3 years ago

The timestamp checking code is here https://github.com/eclipse/openj9/blob/master/runtime/shared_common/TimestampManagerImpl.cpp#L61

If this is caused by the timestamp issue, there is an option -Xshareclasses:noTimestampChecks that you can try, which turns off the timestamp check. See https://www.eclipse.org/openj9/docs/xshareclasses/#notimestampchecks

ymanton commented 3 years ago

Update regarding the SCC; I wasn't able to find any evidence of timestamp mismatches, so I took a closer look at the code in question and found that we don't actually have to worry about the precision problem after all. We call stat(), which returns a struct containing timestamps in both seconds and optionally nanoseconds, but we ignore the ns-precision fields and stick to the s-precision fields.

https://github.com/eclipse/omr/blob/67ddc287341ab2bce7dfee85f1fde1f1e9b3889d/port/unix/omrfile.c#L731-L747