MPh-py / MPh

Pythonic scripting interface for Comsol Multiphysics
https://mph.readthedocs.io
MIT License
265 stars 67 forks source link

Fatal error `EXCEPTION_ACCESS_VIOLATION` after many model solves #131

Closed AlecEmser closed 1 year ago

AlecEmser commented 1 year ago

I've written some code which uses an SHGO optimizer to feedback on the .csv output of a COMSOL file. The program is fairly simple: a COMSOL model is modified with parameters passed by the SHGO optimizer and then solved. A probe table is saved to a csv, which is read and analyzed to return a cost to the SHGO optimizer. This all runs very smoothly and converges without problem for small iteration numbers.

However, after several hours of running the program (~several hundred calls to model.solve() ) I receive the following error which kills the python kernel:

A fatal error has been detected by the Java Runtime Environment: EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x00007ffe3a5b3a71, pid=22828, tid=4812 JRE version: OpenJDK Runtime Environment Temurin-11.0.13+8 (11.0.13+8) (build 11.0.13+8) Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.13+8 (11.0.13+8, mixed mode, tiered, compressed oops, g1 gc, windows-amd64) Problematic frame: C 0x00007ffe3a5b3a71

The full output of this log is attached: hs_err_pid22828.log I suspect that the issue is related to Java running out of non-class metaspace, as the allocations are always full when failure occurs:

Metaspace:

Usage: Non-class: 139.98 MB capacity, 138.16 MB ( 99%) used, 1.37 MB ( <1%) free+waste, 453.19 KB ( <1%) overhead. Class: 14.00 MB capacity, 12.93 MB ( 92%) used, 939.59 KB ( 7%) free+waste, 163.56 KB ( 1%) overhead. Both: 153.99 MB capacity, 151.09 MB ( 98%) used, 2.29 MB ( 1%) free+waste, 616.75 KB ( <1%) overhead.

Virtual space: Non-class space: 142.00 MB reserved, 140.25 MB ( 99%) committed Class space: 1.00 GB reserved, 14.13 MB ( 1%) committed Both: 1.14 GB reserved, 154.38 MB ( 13%) committed

I have ensured everything in my Python distribution is up to date and have run the code with both Spyder and VSCode Jupyter yet the issue persists. Similarly, I have tried both COMSOL 6.0 and 6.1. Assuming that the issue might be some memory leak associated with JPype, I've also attempted resetting the client (and hopefully the JPype connection?) each iteration with both client.clear() and a more agressive sequence of delete, garbage collect, and client restart. Neither has worked.

I have also attempted to tweak the -XX:MaxMetaspaceSize -Xss -Xms -Xmx parameters located in the comsol.ini file under COMSOL\COMSOL61\Multiphysics\bin\win64\ but these values seem uncorrelated to the virtual space described in the log. The machine I am running on has 128G of RAM so I am not afraid of allocating significantly more memory to the metaspace that appears to be filling up, if only I knew how.

Anyway, thank you for any insight into this problem. I am happy to share any more information which may be helpful.

john-hen commented 1 year ago

Hi. Thanks for the detailed report. I'm afraid though the insight I can provide will be rather limited.

First off, I hope you have a way to resume the optimization runs. When I did this in the past, also with hundreds of iterations, I would "cache" the input parameters and output results (like the .csv files in your case). So when it crashed, I wouldn't have to rerun the whole thing again.

And there were crashes. I don't remember the specifics, maybe these access violations also happened occasionally. But there are many ways in which Comsol could crash. Like, if the disk fills up with temporary files.

Memory leaks must not necessarily be JPype's fault, that could also be Comsol. In fact, in past issues here, Comsol was the culprit more often than not. Keep in mind that what you're doing is a rare use case: running that many solves within the same session. Comsol is arguably not battle-tested for this scenario. Most people do a few solves, whether that's in the GUI or in their custom Java-based application that uses Comsol as a back-end.

I've also attempted resetting the client (and hopefully the JPype connection?) each iteration with both client.clear() and a more agressive sequence of delete, garbage collect, and client restart. Neither has worked.

How do you restart the client? I don't think that's possible. We cannot restart the Java VM within the Python process. So the only way to restart the client is to terminate the session, i.e. Python must exit.

I have also attempted to tweak the -XX:MaxMetaspaceSize -Xss -Xms -Xmx parameters located in the comsol.ini

That's a good point, I've never thought about this. (Also because these details are not well documented on Comsol's side.) But since we start the Java VM ourselves (via JPype), the settings in comsol.ini do not have any effect. I just tested that. We could of course parse the file and set the parameters. It's an option I will consider. But there are also settings in there which we should ignore, so I'm not sure how robust and future-proof that would be.

If you want to play around with the memory settings, you can start the session in client-server mode. That's something you should try anyway, maybe that alone solves it. Since you're on Windows, the default mode of operation is a stand-alone client. You can switch to client-server mode withmph.option('session', 'client-server'). Then the relevant configuration file is comsolmphserver.ini. And since we start the server as an external process, it should just read that file like it always does, when started from the command line.

I'm no expert on this, but I don't think memory is the issue here. Note that the metaspace is not nearly maxed out. What you see there is the "capacity", but that capacity should be dynamically increased if more is needed. The maximum metaspace size you also find in the crash log. It's an astronomical 17179869184.00 GB.

I don't know what "virtual space" is. Again, not an expert on the Java VM. And yeah, the 99% (in both cases) is strange. Looks like the garbage collector should do something about that. But I don't see how that leads to an access violation. Which seems to occur in native code, and thus outside the Java VM.

AlecEmser commented 1 year ago

Hi, thanks so much for your expedient and detailed reply.

To respond to your points:

I hope you have a way to resume the optimization runs.

Indeed I'm caching my intermediate results.

I'm no expert on this, but I don't think memory is the issue here.

I suspect you are correct and that the seemingly-full memory space was only a red herring. Given the dynamic memory allocation, this must be the intended functionality.

You can start the session in client-server mode. That's something you should try anyway, maybe that alone solves it.

It seems like calling mph.option('session', 'client-server') does help: I'm no longer getting an exception access violation, but I'm now stochastically getting a java.lang.NullPointerException in java.run() called from node.run() in turn called from model.solve(). This doesn't kill the kernel, so it feels like an improvement. I suspect that this error is not the cause but rather a symptom of the client dying -- after the error occurs any function calls to the client like client.clear() or client.load() return com.comsol.util.exceptions.FlException: Not connected to a server. It seems like I can only fix the connection by terminating the session and re-initializing the client. If there were a way to restore the connection without restarting the kernel then I could add error handling to do this automatically? I'm not sure if this is possible.

I should add that I've attempted to run this same code on a colleague's computer (also Windows) and it runs without issue. I've re-installed and updated every relevant piece of software on my machine, so my deep concern is that this is related to something deeper I may not solve.

Thanks again for your help, and thanks for your work on this package. It's really quite wonderful.

john-hen commented 1 year ago

I'm now stochastically getting a java.lang.NullPointerException in java.run() called from node.run() in turn called from model.solve(). This doesn't kill the kernel, so it feels like an improvement. I suspect that this error is not the cause but rather a symptom of the client dying -- after the error occurs any function calls to the client like client.clear() or client.load() return com.comsol.util.exceptions.FlException: Not connected to a server.

I think it means that the server died, not the client. Most likely for the same reason though. That wouldn't take the kernel down, which is the Python process that the Comsol client runs in. The Comsol server, on the other hand, runs in an external process, so the Python kernel wouldn't know if that other process is gone. The client then just reports that it can no longer connect to the server.

If there were a way to restore the connection without restarting the kernel then I could add error handling to do this automatically? I'm not sure if this is possible.

It's not. Sorry. This is a limitation of JPype. We cannot restart the Java VM, and thus Comsol client, without restarting the kernel, i.e. Python process. It's probably somehow technically possible that JPype could accomplish that, but there are also performance trade-offs to consider. (It is also a limitation of Comsol itself. It could just let us shut down the client without having to restart Java.)

I should add that I've attempted to run this same code on a colleague's computer (also Windows) and it runs without issue. I've re-installed and updated every relevant piece of software on my machine, so my deep concern is that this is related to something deeper I may not solve.

That's great for your colleague. But you're not alone in facing these kind of issues. I don't think we can solve them without some Comsol developer digging in. Unless it's related to resource allocation, as configured by the comsol.ini file mentioned above (which, again, we are thus far ignoring), there's nothing we can do on the Python side. The only way to bring this to Comsol's attention would be to replicate the issue using either Matlab or Java, and then report it to Comsol Support. Which is a lot of work. (Comsol should really publish their own "Python LiveLink". I'm all for it.)

Thanks again for your help, and thanks for your work on this package. It's really quite wonderful.

Thanks. Always great to hear. If you haven't already, consider starring the repo here on GitHub. It adds visibility. And... also feels rewarding.

john-hen commented 1 year ago

Closed at it does not seem to a problem with the Python code of MPh. As Java exceptions go, they are most likely to be a Comsol problem.