MPh-py / MPh

Pythonic scripting interface for Comsol Multiphysics
https://mph.readthedocs.io
MIT License
265 stars 67 forks source link

`OutOfMemoryError` when evaluating particle trajectories #132

Closed wacht02 closed 1 year ago

wacht02 commented 1 year ago

Hello! First let me thank you for your help with #128. I think that I can now run any model on the cluster as long as there is enough RAM available. I have now encountered a different problem which might be connected to #128.

Here is some general information: The Cluster is running CentOS7 and COMSOL 6.1.0.252 is installed. I am using Python 3.10.4 with MPh 1.2.3. The Windows machiene I use has Windows 10 and also uses COMSOL 6.1.0.252 with Python 3.9.7 and MPh 1.2.3. I am doing particle tracing simulations and am interested in the trajectory of the particles. Therefore I want to export all timesteps of the time dependent study I am performing.

After I solve the model I want to evaluate some expressions using the model.evaluate() function but it crashes with a similar error message to #128:

Traceback (most recent call last):
  File "java.lang.Thread.java", line -1, in java.lang.Thread.run
com.comsol.util.exceptions.com.comsol.util.exceptions.UnexpectedServerException: com.comsol.util.exceptions.UnexpectedServerException: Out of memory on server.
java.lang.OutOfMemoryError: Out of memory.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "SourceFile", line 288, in com.comsol.clientapi.impl.NumericalFeatureClient.getReal
Exception: Java Exception

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/scripts/python/Read_COMSOL_to_Python.py", line 222, in <module>
    data = eval_basics(model = model, dataset = f'Particle {sys.argv[6]}', short = False)
  File "/home/scripts/python/Read_COMSOL_to_Python.py", line 114, in eval_basics
    px = model.evaluate(expression = 'qx', unit = 'mm', dataset = dataset, inner = inner, outer = outer)
  File "/home/test_env/lib/python3.10/site-packages/mph/model.py", line 591, in evaluate
    results = array(java.getReal())
com.comsol.util.exceptions.com.comsol.util.exceptions.FlException: Exception:
    com.comsol.util.exceptions.UnexpectedServerException: Out of memory on server.
java.lang.OutOfMemoryError: Out of memory.
    (rethrown as com.comsol.util.exceptions.FlException)
Messages:
    Out of memory on server.
java.lang.OutOfMemoryError: Out of memory.

As you can see it again is a memory error. I implemented the changes discussed in #128 (changing the directory for temporary files and disabling saving of recovery files). I requested more than enough memory for the evaluation and can also see in the monitoring software of the cluster that there is enough space available. On the Windows machiene the script runs without a problem.

I also tried to lower the amount of memoy needed by decreasing the amount of particles in the simulation. This did eventually work for really small numbers so this is seems not to be a general issue with the program.

I also tried to first save the solved model and then evaluate it in a different job but that second job then crashes with the same error message.

If you can think of anything I should try next or need further information please let me know. I also asked the cluster support and am waiting for an answer.

Also a side question to the model.evaluate() method. I thought about splitting the evaluation up into smaller "batches" using the inner parameter. I looked into the source code and it seems to me that inner just strips the evaluation for all timesteps and only returnes the one specified. Is that correct?

john-hen commented 1 year ago

Hi. This strikes me as a different problem though. In the other issue, it was lack of disk space. But here you're running out of RAM. So when you say that there is enough space available, which one do you mean?

If you do have enough RAM, then another possible bottleneck is Java's own memory management. When you evaluate the solution, it has to transfer that data into the "Java heap space". Personally, I've never seen an issue with that. Though I never ran simulations with more than 100,000 particles, I don't think.

The Java memory settings for the Comsol server instance can be configured in comsolmphserver.ini in the Comsol installation folder. Though on the cluster, you cannot change that file. I don't know if there is another way to configure these settings. But according to the Comsol documentation, that's rarely ever needed. If you only get it to work "for really small numbers" then I don't really think that's the issue. The default value is 2 GB, that fits many millions of floating-point numbers.

I looked into the source code and it seems to me that inner just strips the evaluation for all timesteps and only returnes the one specified. Is that correct?

Yes. That's Comsol terminology: "inner" solutions refer to the time parameter, "outer" solution to any other parameter, i.e. in an actual parameter sweep. So the time parametrization is special.

wacht02 commented 1 year ago

Thanks for the reply.

So when you say that there is enough space available, which one do you mean?

By that I mean both disc space (to store temporary files and such) aswell as RAM. I requested 20Gb for the job and it peaked at a usage of about 7Gb (seen in the monitoring software of the cluster, also said on the job report). So I don't think it is an actual memory issue.

I'll look into the "Java heap space" but to give you some numbers: The evaluations runs with 1.000 particles but already crashes with 2.000. I already evaluate one variable at a time like

x = model.evalute('cpt.px', 'mm')
y = model.evalute('cpt.py', 'mm')

and so on. The solver does about 35.000 time steps. So thats already 35.000.000 values from one evaluation. Could this already exhaust the jave heap space? I find that weird since on the Windows machiene it is set to 2Gb and there i can evaluate the simulation with 2.000 particles.

Yes. That's Comsol terminology: "inner" solutions refer to the time parameter, "outer" solution to any other parameter, i.e. in an actual parameter sweep. So the time parametrization is special.

Maybe my question wasn't clear. I was more thinking that model.evaluate() gets the values for all time steps regardless if an inner is specified. If you give a value for inner it will then return only these specified values. So from a memory point it doesn't make a difference if you give an inner or not.

john-hen commented 1 year ago

Ah, okay, that's indeed a lot of time steps. So that does get you into gigabyte territory. For the three coordinates of 1000 particles you'd then have 840 MB of floating-point arrays alone. Plus whatever memory overhead that may come with it. Maybe it is limited by the Java heap space then.

If I read the implementation of model.evaluate() correctly, when we specify inner, it should actually only transfer the data for those select time steps. So it should make a difference from a memory point of view. Or rather: it could. I don't think I've ever profiled that. What Comsol really does in the background is never so clear. But on our side, we're already trying to keep the memory footprint low. That is, we set the innerinput property of the EvalPoint feature to manual and then pass only the solnum indices we want.

(For particles, to be clear. For other datasets, we filter according to inner after retrieving the data. Probably because the evaluation feature we use there, Eval, does not have that innerinput property.)

I find that weird since on the Windows machiene it is set to 2Gb and there i can evaluate the simulation with 2.000 particles.

I believe it's 4 GB on Windows, when using the stand-alone client (i.e. the default). That's just the default value of the Java VM, and it's currently not configurable. (Though in principle it could be.) Maybe that explains the difference. The server has only 2 GB. I don't know yet how to increase that limit without changing comsolmphserver.ini. Maybe the -Xmx parameter could also be passed on the command line. Though then I don't really know how to test that. The Java commands we can issue via JPype are running inside the client, so we can't really inspect what's going on with the separate Java VM of the server.

wacht02 commented 1 year ago

I am now almost certain that it is the Java heap space.

I have copied the comsolmphserver.ini file from the installation directory on the cluster to my /home/ directory and modified the -Xmx2g entry to -Xmx4g. I now tell COMSOL to use this .ini file when starting by passing '-comsolinifile', '/home/../comsolmphserver.ini' as arguments to MPh. This seems to do the trick.

The question now is how much Java heap space I will be needing since I want to run the simulation with a lot more particles.

I believe it's 4 GB on Windows, when using the stand-alone client (i.e. the default). That's just the default value of the Java VM, and it's currently not configurable. (Though in principle it could be.) Maybe that explains the difference. The server has only 2 GB. I don't know yet how to increase that limit without changing comsolmphserver.ini. Maybe the -Xmx parameter could also be passed on the command line. Though then I don't really know how to test that. The Java commands we can issue via JPype are running inside the client, so we can't really inspect what's going on with the separate Java VM of the server.

I checked the .ini files on the Windows machiene and there it was also set to -Xmx2g, so "only" 2Gb of Java heap space. I might be missing something here though.

If I read the implementation of model.evaluate() correctly, when we specify inner, it should actually only transfer the data for those select time steps. So it should make a difference from a memory point of view. Or rather: it could. I don't think I've ever profiled that. What Comsol really does in the background is never so clear. But on our side, we're already trying to keep the memory footprint low. That is, we set the innerinput property of the EvalPoint feature to manual and then pass only the solnum indices we want.

This is good news and I probably missunderstood the code there. I was thinking of splittin the evaluation up into smaller parts. So the first evaluation gets the first 10% of all time steps, the second evaluation the second 10% and so on. If the memory footprint is actually smaller, when specifying in inner then this may be benefitial.

Lastly I think I may be wrong in #128 by stating that -autosave off does not write recovery files any more. COMSOL still writes them but deletes them after the solving is done (I think). Found this out as I was trying to set up a python environment and couldn't do stuff as my /home/ directory was filled with 470Gb worth of recovery files. I think they don't get deleted when the program crashes during solving. So you might want to watch out for that and possibly specify -recoverydir to somewhere with more capacity.

john-hen commented 1 year ago

Great, so we can just pass the entire ini file. Makes it quite a bit more inconvenient to start the session, but at least there is a way then.

Yes, these recovery files are a pain. I've tried everything I could think of to get rid of them. They are useless when scripting things and just fill up the hard drive. But whatever I did, maybe there are less of them, but they are never completely gone. I don't know what's up with that. Looks like a bug in Comsol.

As I've learned just yesterday, in issue #131, the way we start the client, it ignores what's in comsol.ini. So that 2 GB setting has no effect, as the following script demonstrates:

import mph
import jpype

mph.start()

MB = 1024 * 1024
max_memory = jpype.java.lang.Runtime.getRuntime().maxMemory()
print(f'maximum memory: {max_memory//MB} MB')

It outputs 4020 MB when run on Windows.

john-hen commented 1 year ago

To elaborate a little more... The difference between Windows and Linux (on the cluster) makes sense if it's just that factor of 2. Like, if you go to something like 4000 particles on Windows, it should also run out of memory.

More generally, Comsol tacitly assumes that "the typical user" wouldn't evaluate results to the point where it would require this kind of Java heap space. Otherwise you'd have to fiddle with the Java memory settings. As you've already figured out how to do.

So this use case is assumed to be rare. Because usually, the memory requirements are determined by the matrix solver. Not the solution vectors. Which is a fair assumption to make. But with this fine a time resolution, that assumption is questionable.

What really happens then, is that Comsol has to transfer those vectors from "native memory" (think back-end math libraries implemented in Fortran or something) into the Java virtual machine. (This is an educated guess. I don't actually know how Comsol works internally.) Once it's there, in the Java VM, we let JPype "represent" that memory space to us as NumPy arrays. As far as I understand (and that understanding also only goes so far), JPype uses "memory views" to accomplish that. That part shouldn't introduce (much) extra memory overhead.

So when you retrieve the data "in chunks", you have to make sure that the allocated memory blocks can be garbage-collected in the mean time. Like, if the variable goes out of scope. For example, you assign x = model.evaluate('cpt.px', 'mm'). This will stay in memory until you return from the function. Or until you assign x to something else. Either the next "chunk", like in a for loop, or even just None. But as long as x points to that array, the array won't be garbage-collected. Because then its "reference count" would remain above zero.

john-hen commented 1 year ago

Closed as this appears to be an issue with Java memory management. If "we" (MPh) can do anything about that on the Python side, feel free to open a new issue to request that feature.