Closed rwest closed 1 year ago
I just ran the examples/rmg/methlyformate case under the profiler. It ran for ~17 hours on my iMac before crashing (with issue #77). Profiling data is at https://gist.github.com/2910440 but in summary, Running time: 17 hours (61,955 sec) 68% (41,975 sec) model updateUnimolecularReactionNetworks (pressure dependent stuff) 23% (14,360 sec) solver.base.ReactionSystem.simulate()
Looking into the first of these:
Function called...
ncalls tottime cumtime
model.py:1250(updateUnimolecularReactionNetworks) -> 11 0.000 0.002 __init__.py:1584(warning)
225 0.001 0.056 __init__.py:1594(info)
44 0.000 0.000 __init__.py:1602(debug)
65 0.004 0.004 pdep.py:231(merge)
3289804 40.901 1361.483 pdep.py:352(update)
214354576 41.754 41.754 {isinstance}
43678774188 2486.316 2486.316 {len}
280 0.002 0.002 {method 'format' of 'str' objects}
1172 0.011 0.011 {method 'getEquilibriumConstant' of 'rmgpy.reaction.Reaction' objects}
2344 0.165 0.165 {method 'getRateCoefficient' of 'rmgpy.reaction.Reaction' objects}
282 0.000 0.000 {method 'insert' of 'list' objects}
933 1.681 1.681 {method 'remove' of 'list' objects}
160 0.000 0.000 {sum}
the actual updating is done in pdep.py:352(update) which only takes 2% total time (1,361 seconds). The rest of it is presumably the loops to remove reverse reactions from the core (for example, len
is called 43,678,774,188 times!). Seems like an area to optimize.
I wonder if there's an entirely different way to do this, or else if we can cythonize it.
I've tried to make updateUnimolecularReactionNetworks() go faster by reducing both the merging of partial pdep networks and the removal of reverse pdep reactions to O(N) instead of O(N^2). It's hard to tell which of these was causing it to be slow - maybe both? - so I tried to address each of them. Either way I've removed many of the calls to len() so hopefully it won't be called 43 billion times any more. I've tested the changes on the 1,3-hexadiene model, and the generated model doesn't appear to have changed.
See my new pdep branch if you want to try it.
Excellent. Look forward to trying it.
I ran it again on your pdep branch. They both crashed in exactly the same way (issue #77 on network 41,809), suggesting the models are the same, but whereas the first took 47,278,589,233 function calls (47,165,643,959 primitive calls) in 61,955.711 seconds (17h) the new one took 3,411,174,244 function calls (3,298,217,272 primitive calls) in 20,996.773 seconds (6h). That's a 3-fold speed up overall! :-)
I suggest we merge that branch to master and look for some more fruit to pick!
New profile data is here
Excellent. Looks like "pdep is slow" may no longer be true. Not bad for ~2 hours of work.
The most obvious thing in the current profile is the solver. I'm starting to consider making this a pure Fortran code (very much like dasslAUTO in RMG-Java) and wrapping it with f2py or similar. The thinking is that any Python overhead becomes magnified a lot in the ODE solve because of all the function calls required. An ancillary benefit is that we could build DASSL right into the solver and thereby eliminate the separate PyDAS requirement. One downside is that this would be a step away from using something like Cantera, and could make it harder to add multiple reactor types in the future if that is ever something we want. I don't think the current approach would scale well to multiple reactor types anyway.
There's probably more low-hanging fruit in the model enlargement code, but since all the boxes are currently blue it would be hard to see with this example until we make the solver faster. Parallelization is probably not too far away, but since we could in principle do that to RMG-Java as well, it probably makes sense to do a bit more single-process optimization first.
Perhaps now is a good time to reconside using Cantera, as we did in early versions of RMG-Py? I think Ray has done lots of work on it recently - perhaps ask him what he thinks.
Parallelizing the ODE solves should be relatively simple, as each reaction system is independent. Getting a generally robust kinetic model requires many reaction systems, so this would be a decent help.
I recently ran a Python vs. Java speed comparison for the graph isomorphism algorithm. To do this I ported the current isomorphism algorithm from Python/Cython to Java. The speed test involves 3000 randomly-generated graphs of the same size, with no semantic information. The timing results (fastest of 4 runs in each case):
12-vertex graphs:
Java: 9000000 graph isomorphisms in 6.3 s
Cython, .pyx file: 9000000 graph isomorphisms in 7.1 s
Cython, .py file: 9000000 graph isomorphisms in 10.6 s
Python: 9000000 graph isomorphisms in 402 s
24-vertex graphs:
Java: 9000000 graph isomorphisms in 7.9 s
Cython, .pyx file: 9000000 graph isomorphisms in 12.1 s
Cython, .py file: 9000000 graph isomorphisms in 16.6 s
Python: 9000000 graph isomorphisms in 695 s
You can immediately see why I was originally excited about Cython, as the speed gain it affords is tremendous (~50x!) Interestingly, there appears to be a noticeable time penalty of ~25% in maintaining a pure Python implementation.
Unfortunately, the Java implementation is consistently ~2x faster than the Cython implementation. Note that I designed the new graph isomorphism algorithm to minimize the most expensive calls into the Python/C API, which means that it is basically tailored to making the Cython version as fast as possible. However, when you look at the generated C code from Cython, there are still a lot of places with significant overhead. For example, the code for iterating over the vertices
for vertex in vertices:
pass
expands to (in rough pseudocode)
if vertices is None
raise Exception
end if
get a reference to vertices and increment its reference counter
set i = 0
do while true
set s to size of list
if i >= s
break
end if
get item at index i in list and increment its reference counter
if item is None or item is not a Vertex
raise Exception
end if
set vertex = item
# Here's where you would use vertex for something...
decrement the reference counter of vertex
end do
decrement the reference counter of vertices
In general we would want all of these extra checks, but in specific cases such as this we know that they are not necessary. For instance, the above could be simplified greatly by assuming that the list does not change within the for loop, that all items are valid vertices, and that the list items do not get deleted within the for loop. Cython cannot do this for us, unfortunately, so we would probably have to hand-code (or maybe use typed memoryviews?) to make up the difference between Cython and Java. I don't think this would need to be done much, just in the absolutely critical bits of code.
A summary of some thoughts:
P.S. For the record, I am running Python 2.7.3, Cython 0.16, and Java 1.7 on a 64-bit machine running Arch Linux.
Latest profiling data for a ~8 hour run on methylformate in the neighbourhood of aeb076a25f266d793068ff8e513dd3e9e321400c. 18% of time spent checking duplicate reactions when writing chemkin output! (29% running simulations) Full data here
It reached:
The model core has 152 species and 7407 reactions
The model edge has 34644 species and 73022 reactions
Updating RMG execution statistics...:
Execution time (DD:HH:MM:SS): 00:07:38:46
Memory used: 3468.67 MB
PBS mem: 3848352kb
PBS vmem: 4547860kb
PBS cput: 07:27:24
PBS walltime: 07:37:56
Hopefully 44f742bf43ef715c9e287f0464dfcb03192f3970 speeds up the chemkin duplicate checking.
I can't see how running the simulations can be trivially parallelized without multiplying the memory requirements by the number of processors (currently each simulation needs the entire core and edge), and as we're currently constrained almost as much by memory as by CPU, that's a problem. I have some ideas, but they'll take a little more effort.
The Chemkin and HTML outputting is still a little slow, but we're already talking single-digit percentage improvements.
Now able to fill up my iMac's memory within 2.5 hours. A methyl-butanoate job with PDep and 4 reaction systems.
Now only 5% saving things (Try pip install MarkupSafe
to speed up the Jinja2 rendering!).
26% reaction simulations 22% PDep networks 18% thermo data estimation (without QM) 15% generating reactions.
NB. profiling tutorial for cython http://docs.cython.org/src/tutorial/profiling_tutorial.html
Job for Cineole finally completed. This job took 3 days (after going from a restart, so we're cheating a bit with the saved QM thermo files.)
Final model enlargement: The model core has 218 species and 1522 reactions The model edge has 12135 species and 807649 reactions
It's a bit surprising that we're spending quite a bit of time on saving the pickle file (20%). This is troubling- can we make this faster?
We are spending 36% on copying molecule objects as well.
Do we know how fast Java would be for the Cineole job? Based on my experience with JP10, I would expect less than a day, maybe even less than 6 hours. It might be good to run it and see.
If you can't make things faster, you can try doing them less often. Saving restart is a good example for this type of "optimization". It may be possible to cythonize some of the pickle helper functions, and/or make it save fewer derived properties and recalculate more when unpicking. (Which we do less often).
I had a look at Molecule.copy(deep=True) the other day. It seemed pretty well cythonized, so not a lot of low hanging fruit there. May be able to squeeze a little out. Or try to think of ways to avoid needing it.
This issue is being automatically marked as stale because it has not received any interaction in the last 90 days. Please leave a comment if this is still a relevant issue, otherwise it will automatically be closed in 30 days.
This "issue" is to collect thoughts and efforts to speed up RMG-Py.