OpenModelica / OpenModelica

OpenModelica is an open-source Modelica-based modeling and simulation environment intended for industrial and academic usage.
https://openmodelica.org
Other
792 stars 300 forks source link

SimCode scales badly when generating models with over a million states #12627

Open casella opened 1 month ago

casella commented 1 month ago

Please compare LargeTestSuite.Mechanical.HarmonicOscillator.HarmonicOscillator_N_409600 and LargeTestSuite.Mechanical.HarmonicOscillator.HarmonicOscillator_N_819200. The first model has 0.8 M states and spends 60.6 s during the SimCode phase, the second has 1.6 M states and spends 151.4 s during the SimCode phase.

Now, this model is sparse and has an extremely regular structure, so there is no reason why the code generation phase should not scale as O(N). This could also be related to the abnormally high memory usage of this phase, see #10231.

@perost could you do some profiling and identify the part(s) of SimCode whose execution time is not scaling linearly between those two examples? Then, we can pass on this information to @kabdelhak and @phannebohm.

Thanks!

Description

Steps to Reproduce

Expected Behavior

Screenshots

Version and OS

Additional Context

perost commented 1 month ago

I ran the 204800 model through valgrind, but it's hard to draw any conclusions from that. The majority of the time is spent in NFComponentRef.toString, it looks like the NB scalarizes the x and y arrays and does a huge amount of hashing of variables.

For example, UnorderedMap.getSafe gets called about 7 million times from NSimVar.getIndex (~2.9 million), NBJacobian.SparsityColoring.PartialD2ColoringAlgC (~2.4 million), and NBJacobian.SparsityPattern.create (~1.2 million).

So that it scales slightly worse than linear is probably just because the hash tables get larger and insertion/lookup takes a bit longer (I don't know how well our usual hashing function deals with e.g. x[1], x[2], ..., x[204800]). One optimization would be to cache the hash for each variable so it doesn't have to be recreated all the time, but reducing the amount of variables and/or hashing would of course be best.

casella commented 1 month ago

Thanks @perost for the analysis!

The majority of the time is spent in NFComponentRef.toString

I don't know the details of what is done in SimCode, but to me a conversion from a reference to a string shouldn't really be the bottleneck in the code generation process. Even if we do it several million times, we are using processors working at 4 GHz or more.

@kabdelhak, @phannebohm, are these calls to NFComponentRef.toString necessary? If so, we should try to make that function efficient. But maybe, as @perost suggests, the algorithms that call it should be made a bit more efficient.

phannebohm commented 1 month ago

Throughout the NB we assume UorderedMap and UnorderedSet to be efficient. Also we assume toString to be the "best" way to hash stuff (I think it was based on this blog post). So I guess it's not a bad idea to try and make NFComponentRef.toString as efficient as possible.

For example, UnorderedMap.getSafe gets called about 7 million times from NSimVar.getIndex (~2.9 million), NBJacobian.SparsityColoring.PartialD2ColoringAlgC (~2.4 million), and NBJacobian.SparsityPattern.create (~1.2 million).

A while ago I spent a bit of time trying to reduce the number of hash calls in some places. I can look at these functions and see if that can be reduced as well.

perost commented 1 month ago

For reference, the 409600 model took about 105s to compile for me.

12632 fixes a missing initial bucket count in NBJacobian.SparsityColoring.PartialD2ColoringAlgC, which shaves off about 1s. There might be other places where an initial bucket count could be added, but I didn't see any obvious ones like this one.

12634 rewrites ComponentRef.hash to not call ComponentRef.toString and instead just compute the hash directly. The hash algorithm we use can be continued by just using the computed hash instead of the initial one, so we can build the hash string by string instead of first concatenating them into one large string. This shaves off about 5s or so (the timing of each run differs by several seconds, so hard to say exactly).

The hashes won't be the same as before since the component reference is now hashed "backwards" for the sake of efficiency. On the whole this shouldn't matter, but some tests cases with debug outputs that depend on the order of the elements need to updated so the PR won't pass on the first try.

perost commented 1 month ago

Both of my PRs are now merged, and I can't really see any other obvious ways to significantly improve performance. It might be possible to make small optimizations here and there, but the main reason for the bad scaling is the huge amount of scalarized variables rather than inefficient data structures.

casella commented 1 month ago

@phannebohm I made some further reflections. It is probably worth making this hash thing more efficient. However, I understand the important scale factor here is the number of states (-> number of rows and columns in the coloured Jacobian of the ODE), not the number of equations. The HarmonicOscillator example is a bit extreme, as all its variables are states. Normal models have 2-10% state variables, so even a 5M-equation model could have only a few hundred thousand states, which leads to reasonable figures.

Bottom line, this is an optimization definitely worth considering, but not top priority. Getting sparse solvers to work #10034 and implementing optimizations such as #8271 and #8026 in the NB has higher priority.