Closed robdockins closed 3 years ago
I've been looking into this recently, and I thought I'd record some of my findings that others may find interesting:
$JAVA_HOME/jre/lib/rt.jar
, JDK 9+ now includes them in $JAVA_HOME/lib/modules
, which is a JIMAGE file. Unlike JAR files, which are essentially just .zip
files in disguise, the JIMAGE file format is intended to be internal to the JDK, and as a result, it is subject to change between different Java releases. As a result, reading the contents of JIMAGE files is discouraged in favor of using the jimage
utility to read it.How fully featured is the jimage
utility? Judging from the --help
text, it is capable of two things which are of interest to SAW:
extract
: Extract entries in a JIMAGE file to a particular directory.list
: Print the names of all entries in a JIMAGE file.At first glance, it would seem like this is all we need. We can use jimage list
to compute the names of all classes that we might use in our program, and whenever SAW needs to load a particular class, it can do so with a command similar to the following:
$ jimage extract --include /<module-name>/<class-name>.class --dir <temporary-directory> <jimage-file>
The problem is that compared to lazily loading classes from JAR files, jimage extract
is much slower. As one microbenchmark, I compared the time it takes to extract a single class (java.lang.Class
) from OpenJDK 8's rt.jar
versus OpenJDK 11's modules
file. There isn't a straightforward way to distill the lazy loading code from SAW for this microbenchmark, so I approximated it with the unzip
command, which supports extracting individual files from archives. Here are the results:
$ time unzip -j rt.jar java/lang/Class.class -d rt-contents
Archive: rt.jar
extracting: rt-contents/Class.class
real 0m0.018s
user 0m0.011s
sys 0m0.007s
$ time jimage extract --include /java.base/java/lang/Class.class --dir modules-contents modules
real 0m0.256s
user 0m0.694s
sys 0m0.028s
The jimage extract
method is over 14× slower than extracting the same class from a JAR file. Oof! I suspect the reason this is the case is because the --include
flag takes the name of a pattern to match, so even though it will ultimately produce a single .class
file at the end, it still has to scan through every file in the JIMAGE file in order to check each file's name against the pattern. For a single class, this takes less than a second overall, but I imagine that this time will quickly add up if you load many classes consecutively.
If jimage extract
is too slow, is there a faster approach? One possible avenue of exploration is to incorporate the jlink
tool into SAW. jlink
supports the creation of customized runtime environments that only include the dependencies that are used in a particular application.
At first, I was hopeful that I could use jlink
to produce a runtime environment that only contains the class dependencies that are used directly by the things in SAW's classpath. Unfortunately, that appears to be not quite the case, as jlink
's smallest unit of granularity is not classes, but rather modules. Modules are another new feature of Java 9 that allow greater control over how code dependencies are managed. As a first approximation, a module can be thought of as a collection of classes, plus metadata which states what other modules they depend on and what modules are allowed to use it. In addition to a classpath, which specifies the locations of .class
files and JARs, Java 9+ now features a modulepath that specifies the locations of modules.
The end result is that if you want to use jlink
to create a custom runtime environment that includes a certain class, you'll have to include the entirety of the module that defines it. Does this help improve performance? A little bit, but not as much as I would have hoped. As one microbenchmark, I used jlink
to create a custom runtime environment that includes java.lang.Class
. In order to do this, I needed to use the jdeps
utility to determine what module it lives in first so that I could then pass it to jlink
:
$ jdeps --print-module-deps Class.class
java.base
$ jlink --no-header-files --no-man-pages --add-modules java.base --output test
$ cd test/lib/
$ time jimage extract --include /java.base/java/lang/Class.class --dir modules-contents modules
real 0m0.162s
user 0m0.316s
sys 0m0.020s
That's certainly an improvement over extracting java.lang.Class
from the whole modules
file, but it's about 9× slower than lazy loading java.lang.Class
from rt.jar
. Bummer.
This all makes me wonder if lazy loading is the right approach here. Perhaps we should just load all of the .class
files from a JIMAGE file once upfront and then use them for the duration of a SAW script? For example, here is the time it takes to extract all of the files in a custom runtime environment that only includes java.base
:
$ time jimage extract --include regex:.*.class --dir modules-contents modules
real 0m0.484s
user 0m0.919s
sys 0m0.245s
This is about 3× slower than loading a single class from the same JIMAGE file, and since java.base
contains way more than 3 classes, this approach would save time overall, assuming that the SAW script actually uses more than 3 of those classes. It's still unclear to me how the performance of this approach compares to the lazy loading approach currently employed by SAW, however.
One middle ground would be to use the jimage extract
command above, but then to still run the Haskell class file parser lazily. The Haskell code is the slowest part of the system, and the most memory intensive, so it's the part we want to be careful about running when it's not necessary.
You raise a good point. In my mind, I was lumping together the processes of (1) uncompressing the archive and (2) parsing the classes themselves, but they're really separate things. If the cost of (2) outweighs the cost of (1), then it might not be as vital to optimize (1) (which is what https://github.com/GaloisInc/saw-script/issues/861#issuecomment-756889807 benchmarks) as I originally thought.
I suspect the key thing is to control the amount of memory usage, which is more important than the time required.
My inclination would be to use jlink
to do all the dependency resolution/gathering from within the Java ecosystem, and then extract the result.
Upon further thought, the jdeps
-followed-by-jlink
approach is trickier than I imagined it would be. This is because in order to fully specify the modules that jdeps
needs for dependency resolution, it needs to know all the .class
files that are used in a program. SAW scripts, however, don't specify .class
files ahead of time, as they can be specified at runtime using java_load_class
. This means that if we wanted to use jdeps
followed by jlink
to create a minimal JRE from which to extract classes to load, then in the worst case, we'd have to recreate this JRE each time a script invokes java_load_class
. After all, each class could introduce new dependencies that previously encountered classes did not use!
I don't think we should take the current SAW interface as given. If we need to force changes to the way JVM code is loaded into SAW to make things work better, I think that should be on the table.
OK, here's one idea. As SAW is currently designed, we already have to specify the JARs and classpaths that a script needs ahead of time using the --jars
and --classpath
flags, respectively. When a script starts, no meaningful work is done with the classpaths (besides storing it for later use). Later, when SAW attempts to load a class it hasn't seen before, only then will it search through the classpaths to see if there is a match.
Instead of doing this, when SAW starts a script using modern Java, SAW can take all of the specified JARs and classpaths, pass them jdeps
, and then use that information to jlink
together a suitable JRE. This JRE will be extracted to a temporary directory and then can be consulted whenever a new class needs to be loaded.
The upside is that we only need to traverse the classpaths once with this approach. The downside is that this process may take longer than is strictly required if the user specifies more classpaths than what are actually needed to run the classes they care about, but I don't think much can be done about that, save for advising users not to over-approximate their dependencies.
I like the approach you're suggesting, @RyanGlScott. I think at the very least it's worth trying and benchmarking to see if it has any problematic overhead in common cases.
Some further investigation reveals that jlink
is not the silver bullet I was hoping it would be, for two reasons:
jlink
would produce a JIMAGE file that contained all of the classes used in an application. Unfortunately, that is not the case. jlink
will bundle all of the classes that live in named modules, but any classes that do not live in named modules will be excluded. As a result, one will still need to rely on traditional classpath-based resolution in order look up any classes that aren't included in the JIMAGE file. Since most Java classes in the wild don't yet use modules (including most of the examples in the saw-script
repo itself), that means the old ways of resolving classpaths aren't going away time soon.OK, so we'll still have to look up things in classpaths. Bummer. But looking up classes in the jlink
-produced JIMAGE file should be quick, right? Well, perhaps not as quick as I would have expected. Recall from https://github.com/GaloisInc/saw-script/issues/861#issuecomment-756889807 that when you jimage extract
a JIMAGE file, the resulting directory structure will look like this template:
<extracted-jimage-path>/<module-name>/<class-name>.class
I was originally hopeful that when we look up a class that lives in a module, we could do so quickly by exploiting the uniquely determined directory structure above. But here's the catch: a ClassName
, to the best of my knowledge, only contains the <class-name>
part and not the <module-name>
part of the template above. Moreover, it's not clear to me how to determine the name of a class's module by simply parsing the corresponding .class
file, which is what SAW does. (I've tried reading the JVM specification for the .class
file format, but I wasn't able to see an obvious way to do this.)
As a result, extracting a JIMAGE file is effectively tantamount to adding some extra directories onto the classpath, as we must search through each <module-name>
directory in order to look up any classes that are a part of the JIMAGE file. (There's some unresolved questions about the best way to faithfully emulate what Java does here. For example, should classes from JIMAGE files on the module path be looked up before classes from the classpath? After? I wasn't able to find any documentation on this point.)
Although jlink
isn't a silver bullet, the silver lining is that we might not need to change SAW's Codebase
type (responsible for loading .class
files) at all. In light of the observations above, loading class files from a JIMAGE file is basically a slightly more convoluted version of what Codebase
already does: add some directories to the classpath and consult those whenever classes are loaded for the first time. The only new thing that SAW would have to do is use jdeps
/jlink
/jimage
to produce the thing that gets added to the classpath.
I have good news and bad news. tl;dr I have a working prototype, but there are some serious performance issues that we should consider carefully.
The good news is that I've pushed a proof-of-concept implementation of the "use jdeps
/jlink
/jimage
to produce the thing that gets added to the classpath" idea from https://github.com/GaloisInc/saw-script/issues/861#issuecomment-762332870 to the wip/T861
branch. The key change is the introduction of a new --java-bin-dirs
flag, which allows specifying the location(s) where jdeps
, jlink
, and jimage
live. With this, I am able to run SAW on some examples in the SAW repo with Java 11.
The bad news is that the startup overhead imposed by this approach is quite severe. I was worried about jimage extract
taking a long time, but the time that jimage
takes is minuscule compared to jlink
. Running jlink
on even a tiny example that only uses java.base
takes at least 3 seconds! To be fair, SAW is only using a small portion of the resulting JIMAGE file (which is about 500 megabytes total), but even passing flags such as --no-header-files --no-man-pages
to reduce the size a bit doesn't make much of a difference.
Another annoying problem is that we extract the contents of the JIMAGE file to a temporary directory which persists after running SAW. This leaves behind about 90 megabytes of cruft each time SAW is invoked. For a single invocation, this isn't really noticeable, but if you were to invoke SAW many times consecutively, I could foresee this adding up quickly. In fact, I'm too afraid to run the test suite at the moment, since I'm worried that I'll chew through too much disk space...
The temporary directory size problem is likely solvable if we make sure to clean up the temporary directory that SAW extracts JIMAGE files to after SAW stops execution. (It's not yet clear to me what the best way to do this is, since SAW has various directions the control can flow in, but I'm sure there's a way we could accomplish this with something bracket
-like.) The jlink
overhead problem, however, is something I don't have an answer for. Ultimately, we don't have much control over jlink
's performance, and since the JIMAGE format is internal to the JDK, we don't have many other alternatives for creating them.
If the bad news from https://github.com/GaloisInc/saw-script/issues/861#issuecomment-762464241 has you frightened, perhaps we should consider some alternatives. Some ideas that come to mind:
jlink
at all, and just use jimage extract
on the monolithic JIMAGE file shipped with Java that includes all of the standard library. This is still slow—about 1.5 seconds on my machine—but that's still quicker than using jlink
beforehand. On the other hand, it would exacerbate the space requirements, as the extracted contents would now take up about 460 megabytyes of space.Although the JIMAGE file format is internal to the JDK (and subject to change in the future), it turns out that the JDK does include a standalone library called libjimage
. (This is surprisingly hard to find, and I only learned about this library's existence through some painfully specific Google searches.) libjimage
is to JIMAGE as libzip
is to ZIP, and to my surprise, the API that libjimage
provides actually includes documentation in the form of comments (although I had to trawl through the OpenJDK source code to find it).
If my reading of the source code for libjimage
is accurate, then you can use a combination of the JIMAGE_FindResource
and JIMAGE_GetResource
functions to retrieve individual classes from a JIMAGE file. I have no idea how fast that would be, but given that the JDK itself uses this API, it's likely to be reasonably quick.
If we wanted to, SAW could link against this library and use it to retrieve classes instead of relying on the jimage
utility. This, of course, would involve more work (making GHC link against C code is uniquely challenging), but a more fundamental question is: how stable is the libjimage
API? Honestly, I have no idea. It's possible that it could change in the future, but one could hope that the API—which is fairly high-level as far as C goes—changes less frequently than the underlying file format. If it's any reassurance, there exists a Rust library that provides bindings to libjimage
, so at least one other person thought this was a good idea.
Based on the above, I'm somewhat inclined to skip jlink
entirely and depend mostly on jimage
. It's somewhat slow, but perhaps not prohibitively so. And there is a flag to limit which classes it exports. For example, this works for me:
jimage extract --include "regex:.*/java/lang/Object.class,regex:.*/java/lang/String.class" $JAVA_HOME/lib/modules
The exact syntax of the string passed to --include
doesn't seem to be documented, so I figured out the above by trial and error. At the very least, the above could be used on a class-by-class basis. Extracting a single class on my machine takes about 0.3s, which isn't ideal but perhaps not the end of the world.
One detail of the above is that it requires knowing where the modules
file is, just like we need to with rt.jar
right now. However, I think that there are some more reliable ways to do that than what we do right now. I wonder if the following will work:
JAVA_HOME
is set, look for either $JAVA_HOME/lib/modules
or $JAVA_HOME/jre/lib/rt.jar
. Only one of those files will exist, and which exists tells us which class loading approach we need to use.JAVA_HOME
is not set, try running java -XshowSettings
and look for java.home = ...
in the result.jimage extract
will use this directory as the argument to jimage extract --dir=...
.jimage
once per class is too slow, jdeps -v <classfile or jarfile>
will identify all classes that class or JAR depends on once and for all, so we can extract them all at once.Does that sound reasonable to you? Am I missing anything?
Ah, I really should have listed your suggestion—use jimage
to extract one class at a time—as alternative (3). I suspect that that approach will still be noticeably slower, since even simple Java programs depend on many classes transitively from the standard library, so even 0.3s of overhead per class is likely to add up fast. But I don't yet have exact numbers to back up this claim, so I should try implementing alternative (3) first.
A few refinements to your suggestions:
One detail of the above is that it requires knowing where the
modules
file is, just like we need to withrt.jar
right now. However, I think that there are some more reliable ways to do that than what we do right now.
I think that if we want to do this The Right Way™, we should emulate Java's module path. As the name suggests, the module path is where Java looks for modules (in the form of JIMAGE files), in the exact same way that the classpath is where Java looks for classes (in the form of JAR and .class
files). Most Java tools allow specifying the module path with the -p
/--module-path
flags, and since SAW already tracks its own paths for .class
and JAR files via the -c
and -j
flags, respectively, it seems like a natural addition for SAW to have its own -p
flag as well.
If SAW were to gain a -p
flag, then we could indicate where the standard modules
file is as a command-line argument, much in the same way that we indicate where rt.jar
lives as an argument to -j
. (Granted, we will also need to indicate to SAW where jimage
is located, so we could consider taking advantage of this to automatically detect where the standard modules
file lives, but it's not strictly required.) More importantly, this would allow SAW to support other Java libraries that are distributed as JIMAGE files.
- If we're working with newer JDKs, create an empty temporary directory on startup to store class files in that we delete on exit. Every call to
jimage extract
will use this directory as the argument tojimage extract --dir=...
.
If we only ever extract a single class at a time, then I don't think we need to share a single temporary directory across all jimage extract
invocations. After all, we only need to extract a class when we load it for the first time, and once we load it, SAW caches the resulting Class
file in its CodebaseState
. As a result, we can extract a class to a temporary directory, load it into the CodebaseState
cache, and delete the temporary directory all in one fell swoop. This avoids the need to ever track temporary directory locations throughout the duration of a single SAW invocation.
- Our treatment of this new directory will be different than a normal class path entry. Every subdirectory of this directory becomes a class path entry. (And new subdirectories may show up at any point, so we'll need to check every time we look there.)
As mentioned above, we can avoid ever adding the results of jimage extract
to the classpath if we simply load one class at a time. There is an interesting question, however, of what order we should look up classes in. Currently, we first looks in the JAR paths (specified by -j
) followed by the classpaths (specified by -c
). Where would the module paths (specified by -p
) fit into this picture? Honestly, I'm not sure, and I couldn't find any documentation on the order in which Java itself does this. (There's this, but I couldn't find a modern version of this that covers Java 9+.) If loading classes from JIMAGE files is the slow path, we might have to consult JIMAGE files last out of sheer necessity.
- If using
jimage
once per class is too slow,jdeps -v <classfile or jarfile>
will identify all classes that class or JAR depends on once and for all, so we can extract them all at once.
One correction here: if you don't use jlink
to combine all of the modules from your module path into a single JIMAGE file, then in general, it's not possible to extract all classes you need in one go. That is, you'd need to invoke jimage extract
on every module in your module path in order to be sure that you've extracted all classes. (For most programs, the module path will only consist of the standard modules
file, but this is a distinction worth making nonetheless.)
This is basically the same approach that I tried in https://github.com/GaloisInc/saw-script/issues/861#issuecomment-762464241, but without the intermediate jlink
step. As a result, there is still the issue of leaving behind a lot of jimage extract
files lingering on a temporary directory after SAW has finished execution. This is likely solvable with enough clever engineering, but I wanted to mention this to contrast with alternative (3), which can solve this issue in a more direct way (see my comments above after "[...] create an empty temporary directory on startup [...]").
Also, thanks for teaching me about jdeps -v
—I was under the impression that jdeps
only listed module dependencies, but it can list class dependencies too! Does this actually list all of the classes that an application transitively depends on, however? For example, if I run jdeps -v
on ecdsa.jar
from SAW's examples
directory, I get:
I strongly suspect that this omits many classes that this would transitively depend on (e.g., java.lang.Class
).
Yeah, I think I agree with you about the Right Way of adding a notion of module path in addition to the existing class path. One thing I'd really like, however, would be for it to be possible for it to "just work" in many cases without user input. The need to specify the location of rt.jar
in the current version tends to be quite painful. Respecting JAVA_HOME
and attempting to use java
to find the home if it's not set seems like a significant usability improvement.
You make a really good point about not needing to keep .class
files on disk if we're loading them on-demand. Making "extract + load + delete" a self-contained step will probably make it less likely that we leave stray class files sitting around.
And you may be right that the output of jdeps -v
only includes direct dependencies. It might be possible to then feed those dependencies into jdeps
again until we reach a fixpoint, though that could be slow. Maybe worth trying, though? If we can get a list of all dependencies, my initial test indicated that extracting two class files from a JIMAGE file took about the same time as extracting one, so extracting 20, for example, might not take much longer.
One final point is that, although I think it's worth having support for multiple JAR or JIMAGE dependency files, the vast, vast majority of the cases are likely to involve only the single modules
file included with the JDK, so optimizing for that case seems to make sense. Although older JDKs included JAR files beyond rt.jar
, we've never done a verification that would depend on them, and I expect we never will. Given that the modules
JIMAGE file contains even more stuff, I'd say that probably will be even more true for JDK versions >=9.
My inclination for search order would be to first search the module path, then JAR files, then classpath directories, though I admit I haven't given it that much thought.
One thing I'd really like, however, would be for it to be possible for it to "just work" in many cases without user input. The need to specify the location of
rt.jar
in the current version tends to be quite painful. RespectingJAVA_HOME
and attempting to usejava
to find the home if it's not set seems like a significant usability improvement.
I agree. It may be worth adding JAVA_HOME
support independently of the effort to support modern Java, since I think that's mostly a matter of setting up some environment variable–related circuitry in the right places. Once the circuitry is installed for SAW's classpath, wiring it up for the module path should be straightforward.
And you may be right that the output of
jdeps -v
only includes direct dependencies. It might be possible to then feed those dependencies intojdeps
again until we reach a fixpoint, though that could be slow. Maybe worth trying, though? If we can get a list of all dependencies, my initial test indicated that extracting two class files from a JIMAGE file took about the same time as extracting one, so extracting 20, for example, might not take much longer.
Iterating jdeps -v
to a fixpoint might be trickier than it sounds. As an experiment, I added an extra putStrLn
in SAW to indicate whenever a class is loaded for the first time. With this, I was able to see how many classes end up getting loaded when running the relatively small ffs_java.saw
example:
[17:45:53.659] Loading file "/home/rscott/Documents/Hacking/Haskell/saw-script/doc/tutorial/code/ffs_java.saw"
[17:45:53.659] Extracting reference term
RGS tryLookupClass: Loading FFS
extracting ffs_ref
RGS tryLookupClass: Loading java/lang/Class
RGS tryLookupClass: Loading java/lang/String
RGS tryLookupClass: Loading java/io/BufferedOutputStream
RGS tryLookupClass: Loading java/io/FilterOutputStream
RGS tryLookupClass: Loading java/io/OutputStream
RGS tryLookupClass: Loading java/io/PrintStream
RGS tryLookupClass: Loading java/lang/Object
RGS tryLookupClass: Loading java/lang/System
RGS tryLookupClass: Loading java/lang/StringIndexOutOfBoundsException
RGS tryLookupClass: Loading java/lang/Exception
[17:45:53.705] Extracting implementation term
extracting ffs_imp
[17:45:53.711] Proving equivalence
[17:45:53.760] Done.
On the other hand, jdeps -v
is very stubborn and only reports one of these classes:
$ jdeps -v FFS.class
FFS.class -> java.base
FFS -> java.lang.Object java.base
I hunted around through jdeps
' command-line options, but I couldn't find anything that made it print more information than that. The question now becomes: how do we feed java.lang.Object
back into jdeps
? The only way I can think of would be to jimage extract
it out, point jdeps
at the location where it was extracted, and repeat. But at that point, we're basically extracting classes more-or-less one at a time—or at the very least, at a much finer granularity than "all at once".
One final point is that, although I think it's worth having support for multiple JAR or JIMAGE dependency files, the vast, vast majority of the cases are likely to involve only the single
modules
file included with the JDK, so optimizing for that case seems to make sense. Although older JDKs included JAR files beyondrt.jar
, we've never done a verification that would depend on them, and I expect we never will. Given that themodules
JIMAGE file contains even more stuff, I'd say that probably will be even more true for JDK versions >=9.
Even if we never do end up using a JAR or JIMAGE file beyond the ones shipped with Java itself, I'll sleep a little easier at night knowing that we could do so in the future if the need arose :)
I spoke with @atomb about this recently. Here are some takeaways:
libjimage
approach, while tempting, is a non-starter due to the fact that it would require linking against C++ code. I'm still curious about the idea of investigating precisely what goes wrong (and if there are perhaps GHC bugs in our way), so I've opened GaloisInc/crucible#621 for this task.libjimage
is out of the question, then we don't have many alternatives besides using the utilities that the JDK ships with. Unfortunately, just about every combination of jdeps
/jlink
/jimage
that I've tried imposes some non-trivial performance overhead. However, that doesn't necessarily mean that using these tools is infeasible—we would just need to figure out a way to amortize the cost of using these tools across multiple SAW invocations.How, indeed, should we amortize the cost? The most expensive part of using jdeps
/jlink
/jimage
is the initial creation/extraction of the .class
files that SAW needs. Once that is done, however, subsequent invocations of SAW could, in theory, re-use the resulting .class
files without needing to go through jdeps
/jlink
/jimage
again. This means that the first time you invoke SAW with a newly encountered JIMAGE file will be slow, but invoking SAW afterwards with the same JIMAGE file should reasonably fast.
Some questions:
jdeps
/jlink
/jimage
? One possibility is to use XDG_CACHE_HOME
(per the XDG Base Directory Specification), which would default to ~/.cache
on non-Windows systems or %LOCALAPPDATA%
on Windows. I'm not sure if SAW is already using XDG directories, but I don't think this is the case.At what granularity should we extract things from JIMAGE files? We've discussed several possibilities above:
I suppose the answer to this question depends on how much space you're willing to devote to a cache. The most likely scenario is that the only JIMAGE file you'll be using is the standard modules
file, which consists of about 447 megabytes of class files when extracted. On the other hand, the java.base
module in particular (which is the only thing that most simple Java programs use) has about 88 megabytes of class files when extracted. Perhaps the granularity we choose should depend on how much space people want to cache.
One correction to my musings above: the module path actually isn't where one specifies the locations of JIMAGE files. Rather, it's the location where Java searches for JARs and .class
files for modularized code. In fact, I'm not exactly sure how you specify the location of JIMAGE files to Java! The answer may very well be "you don't", as the Java developers likely intended for jlink
to completely hide away the details of JIMAGE files from end users.
This is arguably a good thing from SAW's point of view, as it means that we can avoid having a separate command-line flag for JIMAGE files at all. Instead, you point SAW at the directory where java
lives and it will take care of the rest. This does mean that #1030 will now be a requirement in order to make this work, rather than just a quality-of-life improvement that makes it easier to invoke SAW.
I've implemented the caching idea from https://github.com/GaloisInc/saw-script/issues/861#issuecomment-763696924 in #1046, operating at the granularity of extracting one class at a time from modules
. This works well enough for the examples I've tried it on, and once the classes are cached, running SAW with JDK 11 is just as fast as with JDK 8.
The PR isn't ready to be merged yet, as there are some tasks that it is blocked on (#993 and #1030), but I'd welcome any feedback you could provide.
See GaloisInc/crucible#638 for the crucible-jvm
–related changes.
Once again, I have good news and bad news.
The good news is that I've factored out basically all of the necessary functionality for reading from JIMAGE files into crucible-jvm
as part of GaloisInc/crucible#638. This is nice because the crucible-jvm
executable in crucible
(which perhaps ought to be called crux-jvm
instead) can also benefit from caching JIMAGE files—in fact, crucible-jvm
and saw
can quite literally share the same cache on the same machine. Moreover, I've demonstrated in #1046 that the SAW test suite passes with JDK 15, the latest release at the time of writing.
The bad news is that despite all this, GaloisInc/crucible#638 is actually quite broken still. I only realized this after trying to run the crucible-jvm
test suite, which consists of a single .java
file that imports String
. As it turns out, crucible-jvm
chokes really badly on String
with modern JDKs. What follows is a tale of woe:
As observed here, running the crucible-jvm
test suite with a modern JDK will fail like this:
@atomb pointed out that this was because crucible
's jvm-parser
submodule, at the time of writing, doesn't include a fix for GaloisInc/saw-script#920. After bumping the submodule to commit https://github.com/GaloisInc/jvm-parser/commit/d440e6bc91c9ba7d48f04e3a7cbf3086d9720b1e, which contains a fix for that issue, I instead get a different error when running the crucible-jvm
test suite:
This time, the problem lies in the fact that the API for the String
class has changed. In JDK 8, the private value
field in String
has type char[]
, but in recent JDKs, such as JDK11, its type has changed to byte[]
. crucible-jvm
, on the other hand, hard-codes the assumption that value
is of type char[]
here:
As a quick experiment, I tried to see what would happen if I updated this assumption to use the modern JDK's type for value
:
diff --git a/crucible-jvm/src/Lang/Crucible/JVM/Translation/Class.hs b/crucible-jvm/src/Lang/Crucible/JVM/Translation/Class.hs
index 3c09eb72..821ab6cd 100644
--- a/crucible-jvm/src/Lang/Crucible/JVM/Translation/Class.hs
+++ b/crucible-jvm/src/Lang/Crucible/JVM/Translation/Class.hs
@@ -911,7 +911,7 @@ refFromString s = do
-- We'll want want to REVISIT this in the future.
obj1 <- setInstanceFieldValue
obj
- (J.FieldId "java/lang/String" "value" J.charArrayTy)
+ (J.FieldId "java/lang/String" "value" J.byteArrayTy)
(RValue (App (JustValue knownRepr arrRef)))
obj2 <- setInstanceFieldValue
This appeared to fix the previous issue, but it uncovered another issue:
A null dereference! I couldn't tell at a glance what was causing it, but it's quite possible that this arises due to other hard-coded assumptions in crucible-jvm
that no longer hold in modern JDKs.
In light of this, I'm forced to admit that there's still quite a bit of investigation that needs to happen to diagnose why crucible-jvm
topples over here. Unfortunately, I don't know if I'll have the time to do so myself, which raises a question of what to do with the current states of GaloisInc/crucible#638 and #1046.
Here is one proposal: given that this works well enough with the sorts of Java programs that SAW includes in its test suite, we could merge GaloisInc/crucible#638 and #1046 now, but explicitly label support for JDK 9+ as experimental in the SAW/Crucible documentation. (We may even consider printing a warning when a user launches these tools with JDK 9+.) We could then open a separate issue in the crucible
repo to track the remaining issues, including the ones described above.
Does this sound reasonable? Or do you think we should hold off until more effort has been spent debugging these issues?
Thanks for the detailed analysis!
My inclination would be to go ahead and merge this (with appropriate warnings about JDK 9+). This is at least a strict improvement over what's in there right now. Most of the sort of verification we normally do with SAW will then work with newer JDKs, but stuff using more of the standard library still won't. That's better than not being able to do anything with modern JDKs.
In that case, I've:
crucible
side to track the remaining loose ends,crucible-jvm
) as a draft, which is now ready for review. I've also added a warning that prints out if you use JDK 9+ or later, which could serve as a template for a similar warning for SAW in #1046.saw-script
's CI.For the sake of posterity, it's worth recording that my earlier claim that the test suite passes with JDK 15 is not quite true. The test_crucible_jvm
test case fails thusly:
[16:18:15.313] Loading file "/home/rscott/Documents/Hacking/Haskell/saw-script/intTests/test_crucible_jvm/teststr_crucible.saw"
[16:18:15.314] **loading TestStr
[16:18:15.326] **Extracting methods
extracting main
[16:18:15.365] resolveField: Field FieldId {fieldIdClass = ClassName "java/lang/Object", fieldIdName = "value", fieldIdType = char[]} not found
CallStack (from HasCallStack):
error, called at src/Lang/Crucible/JVM/Translation/Monad.hs:61:15 in crucible-jvm-0.2-inplace:Lang.Crucible.JVM.Translation.Monad
jvmFail, called at src/Lang/Crucible/JVM/Translation/Class.hs:836:23 in crucible-jvm-0.2-inplace:Lang.Crucible.JVM.Translation.Class
I discovered, while working on standing up a new computer, that the way the SAW infrastructure interacts with the Java ecosystem is by now quite outdated. Starting with Java 9, the JRE no longer distributes its base libraries in a monolithic
rt.jar
, but instead distributes these classes in a new 'JMOD' file format. This is part of an overall infrastructure philosophy change that introduces a new link phase and utilityjlink
, which gathers together all the transitive dependencies of an executable into a "linked" executable.The following is the best explanation of this system I've found https://stackoverflow.com/questions/44732915/why-did-java-9-introduce-the-jmod-file-format
For our purposes, I think it would actually makes our life simpler. As we do with with LLVM (via
clang
andllvm-link
), we can rely on the JDK utilities likejlink
andjimage
to collect together all the pieces we need into one place, rather than having to do complicated classpath handling, discovering where the system has installed various bits we need, etc.