Modern Java support - Githubissues

robdockins commented 4 years ago

I discovered, while working on standing up a new computer, that the way the SAW infrastructure interacts with the Java ecosystem is by now quite outdated. Starting with Java 9, the JRE no longer distributes its base libraries in a monolithic rt.jar, but instead distributes these classes in a new 'JMOD' file format. This is part of an overall infrastructure philosophy change that introduces a new link phase and utility jlink, which gathers together all the transitive dependencies of an executable into a "linked" executable.

The following is the best explanation of this system I've found https://stackoverflow.com/questions/44732915/why-did-java-9-introduce-the-jmod-file-format

For our purposes, I think it would actually makes our life simpler. As we do with with LLVM (via clang and llvm-link), we can rely on the JDK utilities like jlink and jimage to collect together all the pieces we need into one place, rather than having to do complicated classpath handling, discovering where the system has installed various bits we need, etc.

RyanGlScott commented 3 years ago

I've been looking into this recently, and I thought I'd record some of my findings that others may find interesting:

Rather than including the Java runtime's classes in $JAVA_HOME/jre/lib/rt.jar, JDK 9+ now includes them in $JAVA_HOME/lib/modules, which is a JIMAGE file. Unlike JAR files, which are essentially just .zip files in disguise, the JIMAGE file format is intended to be internal to the JDK, and as a result, it is subject to change between different Java releases. As a result, reading the contents of JIMAGE files is discouraged in favor of using the jimage utility to read it.
How fully featured is the jimage utility? Judging from the --help text, it is capable of two things which are of interest to SAW:
- extract: Extract entries in a JIMAGE file to a particular directory.
- list: Print the names of all entries in a JIMAGE file.
At first glance, it would seem like this is all we need. We can use jimage list to compute the names of all classes that we might use in our program, and whenever SAW needs to load a particular class, it can do so with a command similar to the following:
```
$ jimage extract --include /<module-name>/<class-name>.class --dir <temporary-directory> <jimage-file>
```
The problem is that compared to lazily loading classes from JAR files, jimage extract is much slower. As one microbenchmark, I compared the time it takes to extract a single class (java.lang.Class) from OpenJDK 8's rt.jar versus OpenJDK 11's modules file. There isn't a straightforward way to distill the lazy loading code from SAW for this microbenchmark, so I approximated it with the unzip command, which supports extracting individual files from archives. Here are the results:
```
$ time unzip -j rt.jar java/lang/Class.class -d rt-contents
Archive:  rt.jar
extracting: rt-contents/Class.class  

real    0m0.018s
user    0m0.011s
sys     0m0.007s

$ time jimage extract --include /java.base/java/lang/Class.class --dir modules-contents modules           

real    0m0.256s
user    0m0.694s
sys     0m0.028s
```
The jimage extract method is over 14× slower than extracting the same class from a JAR file. Oof! I suspect the reason this is the case is because the --include flag takes the name of a pattern to match, so even though it will ultimately produce a single .class file at the end, it still has to scan through every file in the JIMAGE file in order to check each file's name against the pattern. For a single class, this takes less than a second overall, but I imagine that this time will quickly add up if you load many classes consecutively.
If jimage extract is too slow, is there a faster approach? One possible avenue of exploration is to incorporate the jlink tool into SAW. jlink supports the creation of customized runtime environments that only include the dependencies that are used in a particular application.

At first, I was hopeful that I could use jlink to produce a runtime environment that only contains the class dependencies that are used directly by the things in SAW's classpath. Unfortunately, that appears to be not quite the case, as jlink's smallest unit of granularity is not classes, but rather modules. Modules are another new feature of Java 9 that allow greater control over how code dependencies are managed. As a first approximation, a module can be thought of as a collection of classes, plus metadata which states what other modules they depend on and what modules are allowed to use it. In addition to a classpath, which specifies the locations of .class files and JARs, Java 9+ now features a modulepath that specifies the locations of modules.

The end result is that if you want to use jlink to create a custom runtime environment that includes a certain class, you'll have to include the entirety of the module that defines it. Does this help improve performance? A little bit, but not as much as I would have hoped. As one microbenchmark, I used jlink to create a custom runtime environment that includes java.lang.Class. In order to do this, I needed to use the jdeps utility to determine what module it lives in first so that I could then pass it to jlink:
```
$ jdeps --print-module-deps Class.class
java.base

$ jlink --no-header-files --no-man-pages --add-modules java.base --output test

$ cd test/lib/

$ time jimage extract --include /java.base/java/lang/Class.class --dir modules-contents modules 

real    0m0.162s
user    0m0.316s
sys     0m0.020s
```
That's certainly an improvement over extracting java.lang.Class from the whole modules file, but it's about 9× slower than lazy loading java.lang.Class from rt.jar. Bummer.
This all makes me wonder if lazy loading is the right approach here. Perhaps we should just load all of the .class files from a JIMAGE file once upfront and then use them for the duration of a SAW script? For example, here is the time it takes to extract all of the files in a custom runtime environment that only includes java.base:
```
$ time jimage extract --include regex:.*.class --dir modules-contents modules 

real    0m0.484s
user    0m0.919s
sys     0m0.245s
```
This is about 3× slower than loading a single class from the same JIMAGE file, and since java.base contains way more than 3 classes, this approach would save time overall, assuming that the SAW script actually uses more than 3 of those classes. It's still unclear to me how the performance of this approach compares to the lazy loading approach currently employed by SAW, however.

atomb commented 3 years ago

One middle ground would be to use the jimage extract command above, but then to still run the Haskell class file parser lazily. The Haskell code is the slowest part of the system, and the most memory intensive, so it's the part we want to be careful about running when it's not necessary.

RyanGlScott commented 3 years ago

You raise a good point. In my mind, I was lumping together the processes of (1) uncompressing the archive and (2) parsing the classes themselves, but they're really separate things. If the cost of (2) outweighs the cost of (1), then it might not be as vital to optimize (1) (which is what https://github.com/GaloisInc/saw-script/issues/861#issuecomment-756889807 benchmarks) as I originally thought.

robdockins commented 3 years ago

I suspect the key thing is to control the amount of memory usage, which is more important than the time required.

My inclination would be to use jlink to do all the dependency resolution/gathering from within the Java ecosystem, and then extract the result.

RyanGlScott commented 3 years ago

Upon further thought, the jdeps-followed-by-jlink approach is trickier than I imagined it would be. This is because in order to fully specify the modules that jdeps needs for dependency resolution, it needs to know all the .class files that are used in a program. SAW scripts, however, don't specify .class files ahead of time, as they can be specified at runtime using java_load_class. This means that if we wanted to use jdeps followed by jlink to create a minimal JRE from which to extract classes to load, then in the worst case, we'd have to recreate this JRE each time a script invokes java_load_class. After all, each class could introduce new dependencies that previously encountered classes did not use!

robdockins commented 3 years ago

I don't think we should take the current SAW interface as given. If we need to force changes to the way JVM code is loaded into SAW to make things work better, I think that should be on the table.

RyanGlScott commented 3 years ago

OK, here's one idea. As SAW is currently designed, we already have to specify the JARs and classpaths that a script needs ahead of time using the --jars and --classpath flags, respectively. When a script starts, no meaningful work is done with the classpaths (besides storing it for later use). Later, when SAW attempts to load a class it hasn't seen before, only then will it search through the classpaths to see if there is a match.

Instead of doing this, when SAW starts a script using modern Java, SAW can take all of the specified JARs and classpaths, pass them jdeps, and then use that information to jlink together a suitable JRE. This JRE will be extracted to a temporary directory and then can be consulted whenever a new class needs to be loaded.

The upside is that we only need to traverse the classpaths once with this approach. The downside is that this process may take longer than is strictly required if the user specifies more classpaths than what are actually needed to run the classes they care about, but I don't think much can be done about that, save for advising users not to over-approximate their dependencies.

atomb commented 3 years ago

I like the approach you're suggesting, @RyanGlScott. I think at the very least it's worth trying and benchmarking to see if it has any problematic overhead in common cases.

RyanGlScott commented 3 years ago

Some further investigation reveals that jlink is not the silver bullet I was hoping it would be, for two reasons:

In the comments above, I gave the impression that jlink would produce a JIMAGE file that contained all of the classes used in an application. Unfortunately, that is not the case. jlink will bundle all of the classes that live in named modules, but any classes that do not live in named modules will be excluded. As a result, one will still need to rely on traditional classpath-based resolution in order look up any classes that aren't included in the JIMAGE file. Since most Java classes in the wild don't yet use modules (including most of the examples in the saw-script repo itself), that means the old ways of resolving classpaths aren't going away time soon.
OK, so we'll still have to look up things in classpaths. Bummer. But looking up classes in the jlink-produced JIMAGE file should be quick, right? Well, perhaps not as quick as I would have expected. Recall from https://github.com/GaloisInc/saw-script/issues/861#issuecomment-756889807 that when you jimage extract a JIMAGE file, the resulting directory structure will look like this template:
```
<extracted-jimage-path>/<module-name>/<class-name>.class
```
I was originally hopeful that when we look up a class that lives in a module, we could do so quickly by exploiting the uniquely determined directory structure above. But here's the catch: a ClassName, to the best of my knowledge, only contains the <class-name> part and not the <module-name> part of the template above. Moreover, it's not clear to me how to determine the name of a class's module by simply parsing the corresponding .class file, which is what SAW does. (I've tried reading the JVM specification for the .class file format, but I wasn't able to see an obvious way to do this.)

As a result, extracting a JIMAGE file is effectively tantamount to adding some extra directories onto the classpath, as we must search through each <module-name> directory in order to look up any classes that are a part of the JIMAGE file. (There's some unresolved questions about the best way to faithfully emulate what Java does here. For example, should classes from JIMAGE files on the module path be looked up before classes from the classpath? After? I wasn't able to find any documentation on this point.)

Although jlink isn't a silver bullet, the silver lining is that we might not need to change SAW's Codebase type (responsible for loading .class files) at all. In light of the observations above, loading class files from a JIMAGE file is basically a slightly more convoluted version of what Codebase already does: add some directories to the classpath and consult those whenever classes are loaded for the first time. The only new thing that SAW would have to do is use jdeps/jlink/jimage to produce the thing that gets added to the classpath.

RyanGlScott commented 3 years ago

I have good news and bad news. tl;dr I have a working prototype, but there are some serious performance issues that we should consider carefully.

The good news is that I've pushed a proof-of-concept implementation of the "use jdeps/jlink/jimage to produce the thing that gets added to the classpath" idea from https://github.com/GaloisInc/saw-script/issues/861#issuecomment-762332870 to the wip/T861 branch. The key change is the introduction of a new --java-bin-dirs flag, which allows specifying the location(s) where jdeps, jlink, and jimage live. With this, I am able to run SAW on some examples in the SAW repo with Java 11.

The bad news is that the startup overhead imposed by this approach is quite severe. I was worried about jimage extract taking a long time, but the time that jimage takes is minuscule compared to jlink. Running jlink on even a tiny example that only uses java.base takes at least 3 seconds! To be fair, SAW is only using a small portion of the resulting JIMAGE file (which is about 500 megabytes total), but even passing flags such as --no-header-files --no-man-pages to reduce the size a bit doesn't make much of a difference.

Another annoying problem is that we extract the contents of the JIMAGE file to a temporary directory which persists after running SAW. This leaves behind about 90 megabytes of cruft each time SAW is invoked. For a single invocation, this isn't really noticeable, but if you were to invoke SAW many times consecutively, I could foresee this adding up quickly. In fact, I'm too afraid to run the test suite at the moment, since I'm worried that I'll chew through too much disk space...

The temporary directory size problem is likely solvable if we make sure to clean up the temporary directory that SAW extracts JIMAGE files to after SAW stops execution. (It's not yet clear to me what the best way to do this is, since SAW has various directions the control can flow in, but I'm sure there's a way we could accomplish this with something bracket-like.) The jlink overhead problem, however, is something I don't have an answer for. Ultimately, we don't have much control over jlink's performance, and since the JIMAGE format is internal to the JDK, we don't have many other alternatives for creating them.

RyanGlScott commented 3 years ago

If the bad news from https://github.com/GaloisInc/saw-script/issues/861#issuecomment-762464241 has you frightened, perhaps we should consider some alternatives. Some ideas that come to mind:

Don't bother with jlink at all, and just use jimage extract on the monolithic JIMAGE file shipped with Java that includes all of the standard library. This is still slow—about 1.5 seconds on my machine—but that's still quicker than using jlink beforehand. On the other hand, it would exacerbate the space requirements, as the extracted contents would now take up about 460 megabytyes of space.
Although the JIMAGE file format is internal to the JDK (and subject to change in the future), it turns out that the JDK does include a standalone library called libjimage. (This is surprisingly hard to find, and I only learned about this library's existence through some painfully specific Google searches.) libjimage is to JIMAGE as libzip is to ZIP, and to my surprise, the API that libjimage provides actually includes documentation in the form of comments (although I had to trawl through the OpenJDK source code to find it).

If my reading of the source code for libjimage is accurate, then you can use a combination of the JIMAGE_FindResource and JIMAGE_GetResource functions to retrieve individual classes from a JIMAGE file. I have no idea how fast that would be, but given that the JDK itself uses this API, it's likely to be reasonably quick.

If we wanted to, SAW could link against this library and use it to retrieve classes instead of relying on the jimage utility. This, of course, would involve more work (making GHC link against C code is uniquely challenging), but a more fundamental question is: how stable is the libjimage API? Honestly, I have no idea. It's possible that it could change in the future, but one could hope that the API—which is fairly high-level as far as C goes—changes less frequently than the underlying file format. If it's any reassurance, there exists a Rust library that provides bindings to libjimage, so at least one other person thought this was a good idea.

atomb commented 3 years ago

Based on the above, I'm somewhat inclined to skip jlink entirely and depend mostly on jimage. It's somewhat slow, but perhaps not prohibitively so. And there is a flag to limit which classes it exports. For example, this works for me:

jimage extract --include "regex:.*/java/lang/Object.class,regex:.*/java/lang/String.class" $JAVA_HOME/lib/modules

The exact syntax of the string passed to --include doesn't seem to be documented, so I figured out the above by trial and error. At the very least, the above could be used on a class-by-class basis. Extracting a single class on my machine takes about 0.3s, which isn't ideal but perhaps not the end of the world.

One detail of the above is that it requires knowing where the modules file is, just like we need to with rt.jar right now. However, I think that there are some more reliable ways to do that than what we do right now. I wonder if the following will work:

If JAVA_HOME is set, look for either $JAVA_HOME/lib/modules or $JAVA_HOME/jre/lib/rt.jar. Only one of those files will exist, and which exists tells us which class loading approach we need to use.
If JAVA_HOME is not set, try running java -XshowSettings and look for java.home = ... in the result.
If we're working with newer JDKs, create an empty temporary directory on startup to store class files in that we delete on exit. Every call to jimage extract will use this directory as the argument to jimage extract --dir=....
Our treatment of this new directory will be different than a normal class path entry. Every subdirectory of this directory becomes a class path entry. (And new subdirectories may show up at any point, so we'll need to check every time we look there.)
If using jimage once per class is too slow, jdeps -v <classfile or jarfile> will identify all classes that class or JAR depends on once and for all, so we can extract them all at once.

Does that sound reasonable to you? Am I missing anything?

RyanGlScott commented 3 years ago

Ah, I really should have listed your suggestion—use jimage to extract one class at a time—as alternative (3). I suspect that that approach will still be noticeably slower, since even simple Java programs depend on many classes transitively from the standard library, so even 0.3s of overhead per class is likely to add up fast. But I don't yet have exact numbers to back up this claim, so I should try implementing alternative (3) first.

A few refinements to your suggestions:

One detail of the above is that it requires knowing where the modules file is, just like we need to with rt.jar right now. However, I think that there are some more reliable ways to do that than what we do right now.

I think that if we want to do this The Right Way™, we should emulate Java's module path. As the name suggests, the module path is where Java looks for modules (in the form of JIMAGE files), in the exact same way that the classpath is where Java looks for classes (in the form of JAR and .class files). Most Java tools allow specifying the module path with the -p/--module-path flags, and since SAW already tracks its own paths for .class and JAR files via the -c and -j flags, respectively, it seems like a natural addition for SAW to have its own -p flag as well.

If SAW were to gain a -p flag, then we could indicate where the standard modules file is as a command-line argument, much in the same way that we indicate where rt.jar lives as an argument to -j. (Granted, we will also need to indicate to SAW where jimage is located, so we could consider taking advantage of this to automatically detect where the standard modules file lives, but it's not strictly required.) More importantly, this would allow SAW to support other Java libraries that are distributed as JIMAGE files.

If we're working with newer JDKs, create an empty temporary directory on startup to store class files in that we delete on exit. Every call to jimage extract will use this directory as the argument to jimage extract --dir=....

If we only ever extract a single class at a time, then I don't think we need to share a single temporary directory across all jimage extract invocations. After all, we only need to extract a class when we load it for the first time, and once we load it, SAW caches the resulting Class file in its CodebaseState. As a result, we can extract a class to a temporary directory, load it into the CodebaseState cache, and delete the temporary directory all in one fell swoop. This avoids the need to ever track temporary directory locations throughout the duration of a single SAW invocation.

Our treatment of this new directory will be different than a normal class path entry. Every subdirectory of this directory becomes a class path entry. (And new subdirectories may show up at any point, so we'll need to check every time we look there.)

As mentioned above, we can avoid ever adding the results of jimage extract to the classpath if we simply load one class at a time. There is an interesting question, however, of what order we should look up classes in. Currently, we first looks in the JAR paths (specified by -j) followed by the classpaths (specified by -c). Where would the module paths (specified by -p) fit into this picture? Honestly, I'm not sure, and I couldn't find any documentation on the order in which Java itself does this. (There's this, but I couldn't find a modern version of this that covers Java 9+.) If loading classes from JIMAGE files is the slow path, we might have to consult JIMAGE files last out of sheer necessity.

If using jimage once per class is too slow, jdeps -v <classfile or jarfile> will identify all classes that class or JAR depends on once and for all, so we can extract them all at once.

One correction here: if you don't use jlink to combine all of the modules from your module path into a single JIMAGE file, then in general, it's not possible to extract all classes you need in one go. That is, you'd need to invoke jimage extract on every module in your module path in order to be sure that you've extracted all classes. (For most programs, the module path will only consist of the standard modules file, but this is a distinction worth making nonetheless.)

This is basically the same approach that I tried in https://github.com/GaloisInc/saw-script/issues/861#issuecomment-762464241, but without the intermediate jlink step. As a result, there is still the issue of leaving behind a lot of jimage extract files lingering on a temporary directory after SAW has finished execution. This is likely solvable with enough clever engineering, but I wanted to mention this to contrast with alternative (3), which can solve this issue in a more direct way (see my comments above after "[...] create an empty temporary directory on startup [...]").

RyanGlScott commented 3 years ago

Also, thanks for teaching me about jdeps -v—I was under the impression that jdeps only listed module dependencies, but it can list class dependencies too! Does this actually list all of the classes that an application transitively depends on, however? For example, if I run jdeps -v on ecdsa.jar from SAW's examples directory, I get:

``` $ jdeps -v ecdsa.jar ecdsa.jar -> java.base com.galois.ecc.AffinePoint -> com.galois.ecc.ECCProvider ecdsa.jar com.galois.ecc.AffinePoint -> java.lang.Object java.base com.galois.ecc.ECC -> com.galois.ecc.ECCProvider ecdsa.jar com.galois.ecc.ECC -> com.galois.ecc.NISTCurveFactory ecdsa.jar com.galois.ecc.ECC -> com.galois.ecc.PublicKey ecdsa.jar com.galois.ecc.ECC -> com.galois.ecc.Signature ecdsa.jar com.galois.ecc.ECC -> java.io.PrintStream java.base com.galois.ecc.ECC -> java.lang.Integer java.base com.galois.ecc.ECC -> java.lang.Long java.base com.galois.ecc.ECC -> java.lang.Object java.base com.galois.ecc.ECC -> java.lang.String java.base com.galois.ecc.ECC -> java.lang.System java.base com.galois.ecc.ECC -> java.util.Random java.base com.galois.ecc.ECCProvider -> com.galois.ecc.AffinePoint ecdsa.jar com.galois.ecc.ECCProvider -> com.galois.ecc.JacobianPoint ecdsa.jar com.galois.ecc.ECCProvider -> com.galois.ecc.PublicKey ecdsa.jar com.galois.ecc.ECCProvider -> com.galois.ecc.Signature ecdsa.jar com.galois.ecc.ECCProvider -> com.galois.ecc.TwinMulAux2Rslt ecdsa.jar com.galois.ecc.ECCProvider -> java.lang.IllegalArgumentException java.base com.galois.ecc.ECCProvider -> java.lang.NullPointerException java.base com.galois.ecc.ECCProvider -> java.lang.Object java.base com.galois.ecc.ECCProvider -> java.lang.String java.base com.galois.ecc.JacobianPoint -> com.galois.ecc.ECCProvider ecdsa.jar com.galois.ecc.JacobianPoint -> java.lang.Object java.base com.galois.ecc.NIST32 -> com.galois.ecc.AffinePoint ecdsa.jar com.galois.ecc.NIST32 -> com.galois.ecc.ECCProvider ecdsa.jar com.galois.ecc.NIST32 -> java.io.PrintStream java.base com.galois.ecc.NIST32 -> java.lang.String java.base com.galois.ecc.NIST32 -> java.lang.System java.base com.galois.ecc.NIST64 -> com.galois.ecc.AffinePoint ecdsa.jar com.galois.ecc.NIST64 -> com.galois.ecc.ECCProvider ecdsa.jar com.galois.ecc.NISTCurveFactory -> com.galois.ecc.ECCProvider ecdsa.jar com.galois.ecc.NISTCurveFactory -> com.galois.ecc.P384ECC32 ecdsa.jar com.galois.ecc.NISTCurveFactory -> com.galois.ecc.P384ECC64 ecdsa.jar com.galois.ecc.NISTCurveFactory -> java.lang.Object java.base com.galois.ecc.P384Constants -> com.galois.ecc.AffinePoint ecdsa.jar com.galois.ecc.P384Constants -> java.lang.Object java.base com.galois.ecc.P384ECC32 -> com.galois.ecc.AffinePoint ecdsa.jar com.galois.ecc.P384ECC32 -> com.galois.ecc.NIST32 ecdsa.jar com.galois.ecc.P384ECC32 -> com.galois.ecc.P384Constants ecdsa.jar com.galois.ecc.P384ECC64 -> com.galois.ecc.AffinePoint ecdsa.jar com.galois.ecc.P384ECC64 -> com.galois.ecc.NIST64 ecdsa.jar com.galois.ecc.P384ECC64 -> com.galois.ecc.P384Constants ecdsa.jar com.galois.ecc.PublicKey -> java.lang.Object java.base com.galois.ecc.Signature -> java.lang.Object java.base com.galois.ecc.TwinMulAux2Rslt -> java.lang.Object java.base ```

I strongly suspect that this omits many classes that this would transitively depend on (e.g., java.lang.Class).

atomb commented 3 years ago

Yeah, I think I agree with you about the Right Way of adding a notion of module path in addition to the existing class path. One thing I'd really like, however, would be for it to be possible for it to "just work" in many cases without user input. The need to specify the location of rt.jar in the current version tends to be quite painful. Respecting JAVA_HOME and attempting to use java to find the home if it's not set seems like a significant usability improvement.

You make a really good point about not needing to keep .class files on disk if we're loading them on-demand. Making "extract + load + delete" a self-contained step will probably make it less likely that we leave stray class files sitting around.

And you may be right that the output of jdeps -v only includes direct dependencies. It might be possible to then feed those dependencies into jdeps again until we reach a fixpoint, though that could be slow. Maybe worth trying, though? If we can get a list of all dependencies, my initial test indicated that extracting two class files from a JIMAGE file took about the same time as extracting one, so extracting 20, for example, might not take much longer.

One final point is that, although I think it's worth having support for multiple JAR or JIMAGE dependency files, the vast, vast majority of the cases are likely to involve only the single modules file included with the JDK, so optimizing for that case seems to make sense. Although older JDKs included JAR files beyond rt.jar, we've never done a verification that would depend on them, and I expect we never will. Given that the modules JIMAGE file contains even more stuff, I'd say that probably will be even more true for JDK versions >=9.

My inclination for search order would be to first search the module path, then JAR files, then classpath directories, though I admit I haven't given it that much thought.

RyanGlScott commented 3 years ago

One thing I'd really like, however, would be for it to be possible for it to "just work" in many cases without user input. The need to specify the location of rt.jar in the current version tends to be quite painful. Respecting JAVA_HOME and attempting to use java to find the home if it's not set seems like a significant usability improvement.

I agree. It may be worth adding JAVA_HOME support independently of the effort to support modern Java, since I think that's mostly a matter of setting up some environment variable–related circuitry in the right places. Once the circuitry is installed for SAW's classpath, wiring it up for the module path should be straightforward.

And you may be right that the output of jdeps -v only includes direct dependencies. It might be possible to then feed those dependencies into jdeps again until we reach a fixpoint, though that could be slow. Maybe worth trying, though? If we can get a list of all dependencies, my initial test indicated that extracting two class files from a JIMAGE file took about the same time as extracting one, so extracting 20, for example, might not take much longer.

Iterating jdeps -v to a fixpoint might be trickier than it sounds. As an experiment, I added an extra putStrLn in SAW to indicate whenever a class is loaded for the first time. With this, I was able to see how many classes end up getting loaded when running the relatively small ffs_java.saw example:

[17:45:53.659] Loading file "/home/rscott/Documents/Hacking/Haskell/saw-script/doc/tutorial/code/ffs_java.saw"
[17:45:53.659] Extracting reference term
RGS tryLookupClass: Loading FFS
extracting ffs_ref
RGS tryLookupClass: Loading java/lang/Class
RGS tryLookupClass: Loading java/lang/String
RGS tryLookupClass: Loading java/io/BufferedOutputStream
RGS tryLookupClass: Loading java/io/FilterOutputStream
RGS tryLookupClass: Loading java/io/OutputStream
RGS tryLookupClass: Loading java/io/PrintStream
RGS tryLookupClass: Loading java/lang/Object
RGS tryLookupClass: Loading java/lang/System
RGS tryLookupClass: Loading java/lang/StringIndexOutOfBoundsException
RGS tryLookupClass: Loading java/lang/Exception
[17:45:53.705] Extracting implementation term
extracting ffs_imp
[17:45:53.711] Proving equivalence
[17:45:53.760] Done.

On the other hand, jdeps -v is very stubborn and only reports one of these classes:

$ jdeps -v FFS.class 
FFS.class -> java.base
   FFS                                                -> java.lang.Object                                   java.base

I hunted around through jdeps' command-line options, but I couldn't find anything that made it print more information than that. The question now becomes: how do we feed java.lang.Object back into jdeps? The only way I can think of would be to jimage extract it out, point jdeps at the location where it was extracted, and repeat. But at that point, we're basically extracting classes more-or-less one at a time—or at the very least, at a much finer granularity than "all at once".

One final point is that, although I think it's worth having support for multiple JAR or JIMAGE dependency files, the vast, vast majority of the cases are likely to involve only the single modules file included with the JDK, so optimizing for that case seems to make sense. Although older JDKs included JAR files beyond rt.jar, we've never done a verification that would depend on them, and I expect we never will. Given that the modules JIMAGE file contains even more stuff, I'd say that probably will be even more true for JDK versions >=9.

Even if we never do end up using a JAR or JIMAGE file beyond the ones shipped with Java itself, I'll sleep a little easier at night knowing that we could do so in the future if the need arose :)

RyanGlScott commented 3 years ago

I spoke with @atomb about this recently. Here are some takeaways:

The libjimage approach, while tempting, is a non-starter due to the fact that it would require linking against C++ code. I'm still curious about the idea of investigating precisely what goes wrong (and if there are perhaps GHC bugs in our way), so I've opened GaloisInc/crucible#621 for this task.
If libjimage is out of the question, then we don't have many alternatives besides using the utilities that the JDK ships with. Unfortunately, just about every combination of jdeps/jlink/jimage that I've tried imposes some non-trivial performance overhead. However, that doesn't necessarily mean that using these tools is infeasible—we would just need to figure out a way to amortize the cost of using these tools across multiple SAW invocations.

How, indeed, should we amortize the cost? The most expensive part of using jdeps/jlink/jimage is the initial creation/extraction of the .class files that SAW needs. Once that is done, however, subsequent invocations of SAW could, in theory, re-use the resulting .class files without needing to go through jdeps/jlink/jimage again. This means that the first time you invoke SAW with a newly encountered JIMAGE file will be slow, but invoking SAW afterwards with the same JIMAGE file should reasonably fast.

Some questions:

Where should we cache the results of jdeps/jlink/jimage? One possibility is to use XDG_CACHE_HOME (per the XDG Base Directory Specification), which would default to ~/.cache on non-Windows systems or %LOCALAPPDATA% on Windows. I'm not sure if SAW is already using XDG directories, but I don't think this is the case.
At what granularity should we extract things from JIMAGE files? We've discussed several possibilities above:
- Extract all classes from a JIMAGE file at once
- Extract individiual classes on an as-needed basis
- Somewhere in between
I suppose the answer to this question depends on how much space you're willing to devote to a cache. The most likely scenario is that the only JIMAGE file you'll be using is the standard modules file, which consists of about 447 megabytes of class files when extracted. On the other hand, the java.base module in particular (which is the only thing that most simple Java programs use) has about 88 megabytes of class files when extracted. Perhaps the granularity we choose should depend on how much space people want to cache.
I'm using the word "cache", but how do we actually check if a JIMAGE file (or some subset of it) has already been extracted to the cache? This will be important since we'll need to avoid unnecessary lookups from JIMAGE files whenever possible for efficiency reasons. If we extract all classes from a JIMAGE file at once, then one simple option would be to store a checksum of the JIMAGE file in the cache itself. If we operate at a finer granularity, then we'll need to be a bit more clever.

RyanGlScott commented 3 years ago

One correction to my musings above: the module path actually isn't where one specifies the locations of JIMAGE files. Rather, it's the location where Java searches for JARs and .class files for modularized code. In fact, I'm not exactly sure how you specify the location of JIMAGE files to Java! The answer may very well be "you don't", as the Java developers likely intended for jlink to completely hide away the details of JIMAGE files from end users.

This is arguably a good thing from SAW's point of view, as it means that we can avoid having a separate command-line flag for JIMAGE files at all. Instead, you point SAW at the directory where java lives and it will take care of the rest. This does mean that #1030 will now be a requirement in order to make this work, rather than just a quality-of-life improvement that makes it easier to invoke SAW.

RyanGlScott commented 3 years ago

I've implemented the caching idea from https://github.com/GaloisInc/saw-script/issues/861#issuecomment-763696924 in #1046, operating at the granularity of extracting one class at a time from modules. This works well enough for the examples I've tried it on, and once the classes are cached, running SAW with JDK 11 is just as fast as with JDK 8.

The PR isn't ready to be merged yet, as there are some tasks that it is blocked on (#993 and #1030), but I'd welcome any feedback you could provide.

RyanGlScott commented 3 years ago

See GaloisInc/crucible#638 for the crucible-jvm–related changes.

RyanGlScott commented 3 years ago

Once again, I have good news and bad news.

The good news is that I've factored out basically all of the necessary functionality for reading from JIMAGE files into crucible-jvm as part of GaloisInc/crucible#638. This is nice because the crucible-jvm executable in crucible (which perhaps ought to be called crux-jvm instead) can also benefit from caching JIMAGE files—in fact, crucible-jvm and saw can quite literally share the same cache on the same machine. Moreover, I've demonstrated in #1046 that the SAW test suite passes with JDK 15, the latest release at the time of writing.

The bad news is that despite all this, GaloisInc/crucible#638 is actually quite broken still. I only realized this after trying to run the crucible-jvm test suite, which consists of a single .java file that imports String. As it turns out, crucible-jvm chokes really badly on String with modern JDKs. What follows is a tale of woe:

As observed here, running the crucible-jvm test suite with a modern JDK will fail like this:

``` javac -g Str.java Compiled from "Str.java" class Str { Str(); Code: 0: aload_0 1: invokespecial #1 // Method java/lang/Object."":()V 4: return public static void main(java.lang.String[]); Code: 0: ldc #2 // String 1234567890 2: astore_1 3: new #3 // class java/lang/StringBuffer 6: dup 7: aload_1 8: invokespecial #4 // Method java/lang/StringBuffer."":(Ljava/lang/String;)V 11: astore_2 12: aload_2 13: iconst_2 14: ldc #5 // String abcd 16: invokevirtual #6 // Method java/lang/StringBuffer.insert:(ILjava/lang/String;)Ljava/lang/StringBuffer; 19: pop 20: getstatic #7 // Field java/lang/System.out:Ljava/io/PrintStream; 23: aload_2 24: invokevirtual #8 // Method java/io/PrintStream.println:(Ljava/lang/Object;)V 27: return } [Crux] Checking "Str.java" starting executeCrucibleJVM [Crux] Index 10 is not a method reference. CallStack (from HasCallStack): error, called at src/Language/JVM/Parser.hs:387:12 in jvm-parser-0.3.0.0-inplace:Language.JVM.Parser ```
@atomb pointed out that this was because crucible's jvm-parser submodule, at the time of writing, doesn't include a fix for GaloisInc/saw-script#920. After bumping the submodule to commit https://github.com/GaloisInc/jvm-parser/commit/d440e6bc91c9ba7d48f04e3a7cbf3086d9720b1e, which contains a fix for that issue, I instead get a different error when running the crucible-jvm test suite:

``` javac -g Str.java Compiled from "Str.java" class Str { Str(); Code: 0: aload_0 1: invokespecial #1 // Method java/lang/Object."":()V 4: return public static void main(java.lang.String[]); Code: 0: ldc #2 // String 1234567890 2: astore_1 3: new #3 // class java/lang/StringBuffer 6: dup 7: aload_1 8: invokespecial #4 // Method java/lang/StringBuffer."":(Ljava/lang/String;)V 11: astore_2 12: aload_2 13: iconst_2 14: ldc #5 // String abcd 16: invokevirtual #6 // Method java/lang/StringBuffer.insert:(ILjava/lang/String;)Ljava/lang/StringBuffer; 19: pop 20: getstatic #7 // Field java/lang/System.out:Ljava/io/PrintStream; 23: aload_2 24: invokevirtual #8 // Method java/io/PrintStream.println:(Ljava/lang/Object;)V 27: return } [Crux] Checking "Str.java" starting executeCrucibleJVM initializeClass java/lang/String (start) initializeClass java/lang/Object (start) special java/lang/Object/ initializeClass java/lang/Object (finish) special java/lang/String/ initializeClass java/lang/String (finish) [Crux] resolveField: Field FieldId {fieldIdClass = ClassName "java/lang/Object", fieldIdName = "value", fieldIdType = char[]} not found CallStack (from HasCallStack): error, called at src/Lang/Crucible/JVM/Translation/Monad.hs:61:15 in crucible-jvm-0.2-inplace:Lang.Crucible.JVM.Translation.Monad jvmFail, called at src/Lang/Crucible/JVM/Translation/Class.hs:836:23 in crucible-jvm-0.2-inplace:Lang.Crucible.JVM.Translation.Class ```

This time, the problem lies in the fact that the API for the String class has changed. In JDK 8, the private value field in String has type char[], but in recent JDKs, such as JDK11, its type has changed to byte[]. crucible-jvm, on the other hand, hard-codes the assumption that value is of type char[] here:

https://github.com/GaloisInc/crucible/blob/12a1fba8b31073d8f7badd3718f8de03f7244d14/crucible-jvm/src/Lang/Crucible/JVM/Translation/Class.hs#L907-L915
As a quick experiment, I tried to see what would happen if I updated this assumption to use the modern JDK's type for value:
```
diff --git a/crucible-jvm/src/Lang/Crucible/JVM/Translation/Class.hs b/crucible-jvm/src/Lang/Crucible/JVM/Translation/Class.hs
index 3c09eb72..821ab6cd 100644
--- a/crucible-jvm/src/Lang/Crucible/JVM/Translation/Class.hs
+++ b/crucible-jvm/src/Lang/Crucible/JVM/Translation/Class.hs
@@ -911,7 +911,7 @@ refFromString s =  do
 -- We'll want want to REVISIT this in the future.
 obj1 <- setInstanceFieldValue
   obj
-    (J.FieldId "java/lang/String" "value" J.charArrayTy)
+    (J.FieldId "java/lang/String" "value" J.byteArrayTy)
   (RValue (App (JustValue knownRepr arrRef)))

 obj2 <- setInstanceFieldValue
```
This appeared to fix the previous issue, but it uncovered another issue:

``` javac -g Str.java Compiled from "Str.java" class Str { Str(); Code: 0: aload_0 1: invokespecial #1 // Method java/lang/Object."":()V 4: return public static void main(java.lang.String[]); Code: 0: ldc #2 // String 1234567890 2: astore_1 3: new #3 // class java/lang/StringBuffer 6: dup 7: aload_1 8: invokespecial #4 // Method java/lang/StringBuffer."":(Ljava/lang/String;)V 11: astore_2 12: aload_2 13: iconst_2 14: ldc #5 // String abcd 16: invokevirtual #6 // Method java/lang/StringBuffer.insert:(ILjava/lang/String;)Ljava/lang/StringBuffer; 19: pop 20: getstatic #7 // Field java/lang/System.out:Ljava/io/PrintStream; 23: aload_2 24: invokevirtual #8 // Method java/io/PrintStream.println:(Ljava/lang/Object;)V 27: return } Warning: Parsing the index cache failed (Data.Binary.Get.runGet at position 16: Non-matching structured hashes: f46da61e7afa58a5e8fd1d2b6fb79899; expected: a257ca064dfb5e0cb74f74e64a975b9e). Trying to regenerate the index cache... Resolving dependencies... [Crux] Checking "Str.java" starting executeCrucibleJVM initializeClass java/lang/String (start) initializeClass java/lang/Object (start) special java/lang/Object/ initializeClass java/lang/Object (finish) special java/lang/String/ initializeClass java/lang/String (finish) new ClassName "java/lang/StringBuffer" (start) fields are [FieldId {fieldIdClass = ClassName "java/lang/AbstractStringBuilder", fieldIdName = "value", fieldIdType = byte[]},FieldId {fieldIdClass = ClassName "java/lang/AbstractStringBuilder", fieldIdName = "coder", fieldIdType = byte},FieldId {fieldIdClass = ClassName "java/lang/AbstractStringBuilder", fieldIdName = "count", fieldIdType = int},FieldId {fieldIdClass = ClassName "java/lang/AbstractStringBuilder", fieldIdName = "EMPTYVALUE", fieldIdType = byte[]},FieldId {fieldIdClass = ClassName "java/lang/AbstractStringBuilder", fieldIdName = "MAX_ARRAY_SIZE", fieldIdType = int},FieldId {fieldIdClass = ClassName "java/lang/StringBuffer", fieldIdName = "toStringCache", fieldIdType = java.lang.String},FieldId {fieldIdClass = ClassName "java/lang/StringBuffer", fieldIdName = "serialVersionUID", fieldIdType = long},FieldId {fieldIdClass = ClassName "java/lang/StringBuffer", fieldIdName = "serialPersistentFields", fieldIdType = java.io.ObjectStreamField[]}] new ClassName "java/lang/StringBuffer" (finish) initializeClass java/lang/StringBuffer (start) initializeClass java/lang/AbstractStringBuilder (start) initializeClass java/lang/Object (start) initializeClass java/lang/Object (finish) invoking static method with return type UnitRepr initializeClass java/lang/AbstractStringBuilder (finish) special java/lang/StringBuffer/ initializeClass java/lang/StringBuffer (finish) invoking static method with return type UnitRepr initializeClass java/lang/String (start) initializeClass java/lang/Object (start) initializeClass java/lang/Object (finish) initializeClass java/lang/String (finish) invoke: java/lang/StringBuffer/insert finish invoke:java/lang/StringBuffer/insert initializeClass java/lang/System (start) initializeClass java/lang/Object (start) initializeClass java/lang/Object (finish) Initializing System.out static field Finished initializing System.out Initializing System.err static field Finished initializing System.err initializeClass java/lang/System (finish) invoke: java/io/PrintStream/println finish invoke:java/io/PrintStream/println initializeClass java/lang/AbstractStringBuilder (start) initializeClass java/lang/Object (start) initializeClass java/lang/Object (finish) invoking static method with return type UnitRepr initializeClass java/lang/AbstractStringBuilder (finish) invoke: java/lang/String/length finish invoke:java/lang/String/length initializeClass java/lang/AbstractStringBuilder (start) initializeClass java/lang/Object (start) initializeClass java/lang/Object (finish) invoking static method with return type UnitRepr initializeClass java/lang/AbstractStringBuilder (finish) invoking static method with return type UnitRepr invoke: java/lang/StringBuffer/append finish invoke:java/lang/StringBuffer/append getfield "java/lang/String.value" invoke: java/lang/String/coder finish invoke:java/lang/String/coder initializeClass java/lang/String (start) initializeClass java/lang/Object (start) initializeClass java/lang/Object (finish) initializeClass java/lang/String (finish) getfield "java/lang/String.coder" initializeClass java/lang/Object (start) initializeClass java/lang/Object (finish) invoking static method with return type UnitRepr initializeClass java/lang/String (start) initializeClass java/lang/Object (start) initializeClass java/lang/Object (finish) initializeClass java/lang/String (finish) initializeClass java/lang/StringUTF16 (start) initializeClass java/lang/Object (start) initializeClass java/lang/Object (finish) invoking static method with return type UnitRepr initializeClass java/lang/StringUTF16 (finish) invoking static method with return type MaybeRepr (ReferenceRepr (RecursiveRepr JVM_object [])) putfield "java/lang/AbstractStringBuilder.value" putfield "java/lang/AbstractStringBuilder.coder" putfield "java/lang/AbstractStringBuilder.value" putfield "java/lang/AbstractStringBuilder.coder" invoke: java/lang/Class/desiredAssertionStatus finish invoke:java/lang/Class/desiredAssertionStatus initializeClass java/lang/StringUTF16 (start) initializeClass java/lang/Object (start) initializeClass java/lang/Object (finish) invoking static method with return type UnitRepr initializeClass java/lang/StringUTF16 (finish) initializeClass java/lang/StringUTF16 (start) initializeClass java/lang/Object (start) initializeClass java/lang/Object (finish) invoking static method with return type UnitRepr initializeClass java/lang/StringUTF16 (finish) invoking static method with return type BVRepr 32 initializeClass java/lang/StringUTF16 (start) initializeClass java/lang/Object (start) initializeClass java/lang/Object (finish) invoking static method with return type UnitRepr initializeClass java/lang/StringUTF16 (finish) initializeClass java/lang/StringUTF16 (start) initializeClass java/lang/Object (start) initializeClass java/lang/Object (finish) invoking static method with return type UnitRepr initializeClass java/lang/StringUTF16 (finish) initializeClass java/lang/StringUTF16 (start) initializeClass java/lang/Object (start) initializeClass java/lang/Object (finish) invoking static method with return type UnitRepr initializeClass java/lang/StringUTF16 (finish) initializeClass java/lang/StringUTF16 (start) initializeClass java/lang/Object (start) initializeClass java/lang/Object (finish) invoking static method with return type UnitRepr initializeClass java/lang/StringUTF16 (finish) Abort due to false assumption: internal: error: in java/lang/StringUTF16. null dereference In java/lang/StringUTF16. at internal When calling In java/lang/AbstractStringBuilder. at internal When calling In java/lang/StringBuffer. at internal When calling In Str.main at internal When calling main [Crux] Attempting to prove verification conditions. [Crux] Simulation complete. [Crux] Found counterexample for verification goal [Crux] Goal status: [Crux] Total: 1 [Crux] Proved: 0 [Crux] Disproved: 1 [Crux] Incomplete: 0 [Crux] Unknown: 0 [Crux] Overall status: Invalid. ```

A null dereference! I couldn't tell at a glance what was causing it, but it's quite possible that this arises due to other hard-coded assumptions in crucible-jvm that no longer hold in modern JDKs.

In light of this, I'm forced to admit that there's still quite a bit of investigation that needs to happen to diagnose why crucible-jvm topples over here. Unfortunately, I don't know if I'll have the time to do so myself, which raises a question of what to do with the current states of GaloisInc/crucible#638 and #1046.

Here is one proposal: given that this works well enough with the sorts of Java programs that SAW includes in its test suite, we could merge GaloisInc/crucible#638 and #1046 now, but explicitly label support for JDK 9+ as experimental in the SAW/Crucible documentation. (We may even consider printing a warning when a user launches these tools with JDK 9+.) We could then open a separate issue in the crucible repo to track the remaining issues, including the ones described above.

Does this sound reasonable? Or do you think we should hold off until more effort has been spent debugging these issues?

atomb commented 3 years ago

Thanks for the detailed analysis!

My inclination would be to go ahead and merge this (with appropriate warnings about JDK 9+). This is at least a strict improvement over what's in there right now. Most of the sort of verification we normally do with SAW will then work with newer JDKs, but stuff using more of the standard library still won't. That's better than not being able to do anything with modern JDKs.

RyanGlScott commented 3 years ago

In that case, I've:

Opened GaloisInc/crucible#641 on the crucible side to track the remaining loose ends,
Unmarked GaloisInc/crucible#638 (which adds the necessary scaffolding to crucible-jvm) as a draft, which is now ready for review. I've also added a warning that prints out if you use JDK 9+ or later, which could serve as a template for a similar warning for SAW in #1046. This is now done.
Opened #1072 for the task of making it easier to test multiple versions of the JDK in saw-script's CI.

RyanGlScott commented 3 years ago

For the sake of posterity, it's worth recording that my earlier claim that the test suite passes with JDK 15 is not quite true. The test_crucible_jvm test case fails thusly:

[16:18:15.313] Loading file "/home/rscott/Documents/Hacking/Haskell/saw-script/intTests/test_crucible_jvm/teststr_crucible.saw"
[16:18:15.314] **loading TestStr
[16:18:15.326] **Extracting methods
extracting main
[16:18:15.365] resolveField: Field FieldId {fieldIdClass = ClassName "java/lang/Object", fieldIdName = "value", fieldIdType = char[]} not found
CallStack (from HasCallStack):
  error, called at src/Lang/Crucible/JVM/Translation/Monad.hs:61:15 in crucible-jvm-0.2-inplace:Lang.Crucible.JVM.Translation.Monad
  jvmFail, called at src/Lang/Crucible/JVM/Translation/Class.hs:836:23 in crucible-jvm-0.2-inplace:Lang.Crucible.JVM.Translation.Class

GaloisInc / saw-script

Modern Java support #861