General speedup - Githubissues

rwest commented 12 years ago

This "issue" is to collect thoughts and efforts to speed up RMG-Py.

rwest commented 12 years ago

I just ran the examples/rmg/methlyformate case under the profiler. It ran for ~17 hours on my iMac before crashing (with issue #77). Profiling data is at https://gist.github.com/2910440 but in summary, Running time: 17 hours (61,955 sec) 68% (41,975 sec) model updateUnimolecularReactionNetworks (pressure dependent stuff) 23% (14,360 sec) solver.base.ReactionSystem.simulate()

Looking into the first of these:

Function                                                           called...
                                                                       ncalls  tottime  cumtime
model.py:1250(updateUnimolecularReactionNetworks)                  ->      11    0.000    0.002  __init__.py:1584(warning)
                                                                          225    0.001    0.056  __init__.py:1594(info)
                                                                           44    0.000    0.000  __init__.py:1602(debug)
                                                                           65    0.004    0.004  pdep.py:231(merge)
                                                                      3289804   40.901 1361.483  pdep.py:352(update)
                                                                    214354576   41.754   41.754  {isinstance}
                                                                    43678774188 2486.316 2486.316  {len}
                                                                          280    0.002    0.002  {method 'format' of 'str' objects}
                                                                         1172    0.011    0.011  {method 'getEquilibriumConstant' of 'rmgpy.reaction.Reaction' objects}
                                                                         2344    0.165    0.165  {method 'getRateCoefficient' of 'rmgpy.reaction.Reaction' objects}
                                                                          282    0.000    0.000  {method 'insert' of 'list' objects}
                                                                          933    1.681    1.681  {method 'remove' of 'list' objects}
                                                                          160    0.000    0.000  {sum}

the actual updating is done in pdep.py:352(update) which only takes 2% total time (1,361 seconds). The rest of it is presumably the loops to remove reverse reactions from the core (for example, len is called 43,678,774,188 times!). Seems like an area to optimize. I wonder if there's an entirely different way to do this, or else if we can cythonize it.

jwallen commented 12 years ago

I've tried to make updateUnimolecularReactionNetworks() go faster by reducing both the merging of partial pdep networks and the removal of reverse pdep reactions to O(N) instead of O(N^2). It's hard to tell which of these was causing it to be slow - maybe both? - so I tried to address each of them. Either way I've removed many of the calls to len() so hopefully it won't be called 43 billion times any more. I've tested the changes on the 1,3-hexadiene model, and the generated model doesn't appear to have changed.

See my new pdep branch if you want to try it.

rwest commented 12 years ago

Excellent. Look forward to trying it.

rwest commented 12 years ago

I ran it again on your pdep branch. They both crashed in exactly the same way (issue #77 on network 41,809), suggesting the models are the same, but whereas the first took 47,278,589,233 function calls (47,165,643,959 primitive calls) in 61,955.711 seconds (17h) the new one took 3,411,174,244 function calls (3,298,217,272 primitive calls) in 20,996.773 seconds (6h). That's a 3-fold speed up overall! :-)

I suggest we merge that branch to master and look for some more fruit to pick!

New profile data is here

jwallen commented 12 years ago

Excellent. Looks like "pdep is slow" may no longer be true. Not bad for ~2 hours of work.

The most obvious thing in the current profile is the solver. I'm starting to consider making this a pure Fortran code (very much like dasslAUTO in RMG-Java) and wrapping it with f2py or similar. The thinking is that any Python overhead becomes magnified a lot in the ODE solve because of all the function calls required. An ancillary benefit is that we could build DASSL right into the solver and thereby eliminate the separate PyDAS requirement. One downside is that this would be a step away from using something like Cantera, and could make it harder to add multiple reactor types in the future if that is ever something we want. I don't think the current approach would scale well to multiple reactor types anyway.

There's probably more low-hanging fruit in the model enlargement code, but since all the boxes are currently blue it would be hard to see with this example until we make the solver faster. Parallelization is probably not too far away, but since we could in principle do that to RMG-Java as well, it probably makes sense to do a bit more single-process optimization first.

rwest commented 12 years ago

Perhaps now is a good time to reconside using Cantera, as we did in early versions of RMG-Py? I think Ray has done lots of work on it recently - perhaps ask him what he thinks.

Parallelizing the ODE solves should be relatively simple, as each reaction system is independent. Getting a generally robust kinetic model requires many reaction systems, so this would be a decent help.

jwallen commented 12 years ago

I recently ran a Python vs. Java speed comparison for the graph isomorphism algorithm. To do this I ported the current isomorphism algorithm from Python/Cython to Java. The speed test involves 3000 randomly-generated graphs of the same size, with no semantic information. The timing results (fastest of 4 runs in each case):

12-vertex graphs:
    Java:              9000000 graph isomorphisms in 6.3 s
    Cython, .pyx file: 9000000 graph isomorphisms in 7.1 s
    Cython, .py file:  9000000 graph isomorphisms in 10.6 s
    Python:            9000000 graph isomorphisms in 402 s
24-vertex graphs:
    Java:              9000000 graph isomorphisms in 7.9 s
    Cython, .pyx file: 9000000 graph isomorphisms in 12.1 s
    Cython, .py file:  9000000 graph isomorphisms in 16.6 s
    Python:            9000000 graph isomorphisms in 695 s

You can immediately see why I was originally excited about Cython, as the speed gain it affords is tremendous (~50x!) Interestingly, there appears to be a noticeable time penalty of ~25% in maintaining a pure Python implementation.

Unfortunately, the Java implementation is consistently ~2x faster than the Cython implementation. Note that I designed the new graph isomorphism algorithm to minimize the most expensive calls into the Python/C API, which means that it is basically tailored to making the Cython version as fast as possible. However, when you look at the generated C code from Cython, there are still a lot of places with significant overhead. For example, the code for iterating over the vertices

for vertex in vertices:
   pass

expands to (in rough pseudocode)

if vertices is None
    raise Exception
end if
get a reference to vertices and increment its reference counter
set i = 0
do while true
    set s to size of list
    if i >= s
        break
    end if
    get item at index i in list and increment its reference counter
    if item is None or item is not a Vertex
        raise Exception
    end if
    set vertex = item
    # Here's where you would use vertex for something...
    decrement the reference counter of vertex
end do
decrement the reference counter of vertices

In general we would want all of these extra checks, but in specific cases such as this we know that they are not necessary. For instance, the above could be simplified greatly by assuming that the list does not change within the for loop, that all items are valid vertices, and that the list items do not get deleted within the for loop. Cython cannot do this for us, unfortunately, so we would probably have to hand-code (or maybe use typed memoryviews?) to make up the difference between Cython and Java. I don't think this would need to be done much, just in the absolutely critical bits of code.

A summary of some thoughts:

Cython code is almost as easy to write as Python, and almost as fast as Java and C/C++. It's way easier than custom-writing C/C++.
We should at least consider using .pyx files and eliminating pure Python support. Even in everyday use I think that the Cython code is so much faster that running in pure Python mode is rare, especially now that a lot of that code is relatively stable. Furthermore, we need .pyx files for efficient array indexing (e.g. in the pressure dependence code) anyway.
To mitigate compilation time, we should spread the Cython code into more smaller modules (as is commonly done in compiled languages).
Most importantly, in the RMG-Java vs. RMG-Py battle royale - two implementations enter, one implementation leaves! - I don't think speed should be the deciding issue, because I think we can still make up the difference between Python/Cython and Java.

P.S. For the record, I am running Python 2.7.3, Cython 0.16, and Java 1.7 on a 64-bit machine running Arch Linux.

rwest commented 11 years ago

Latest profiling data for a ~8 hour run on methylformate in the neighbourhood of aeb076a25f266d793068ff8e513dd3e9e321400c. 18% of time spent checking duplicate reactions when writing chemkin output! (29% running simulations) Full data here RMG profile dot

It reached:

The model core has 152 species and 7407 reactions
The model edge has 34644 species and 73022 reactions

Updating RMG execution statistics...:

Execution time (DD:HH:MM:SS): 00:07:38:46
Memory used: 3468.67 MB
PBS mem: 3848352kb
PBS vmem: 4547860kb
PBS cput: 07:27:24
PBS walltime: 07:37:56

rwest commented 11 years ago

Hopefully 44f742bf43ef715c9e287f0464dfcb03192f3970 speeds up the chemkin duplicate checking.

I can't see how running the simulations can be trivially parallelized without multiplying the memory requirements by the number of processors (currently each simulation needs the entire core and edge), and as we're currently constrained almost as much by memory as by CPU, that's a problem. I have some ideas, but they'll take a little more effort.

The Chemkin and HTML outputting is still a little slow, but we're already talking single-digit percentage improvements.

rwest commented 11 years ago

Now able to fill up my iMac's memory within 2.5 hours. A methyl-butanoate job with PDep and 4 reaction systems. rmg profile dot

Now only 5% saving things (Try pip install MarkupSafe to speed up the Jinja2 rendering!).

26% reaction simulations 22% PDep networks 18% thermo data estimation (without QM) 15% generating reactions.

rwest commented 11 years ago

NB. profiling tutorial for cython http://docs.cython.org/src/tutorial/profiling_tutorial.html

connie commented 11 years ago

Job for Cineole finally completed. This job took 3 days (after going from a restart, so we're cheating a bit with the saved QM thermo files.)

Final model enlargement: The model core has 218 species and 1522 reactions The model edge has 12135 species and 807649 reactions

file

It's a bit surprising that we're spending quite a bit of time on saving the pickle file (20%). This is troubling- can we make this faster?
We are spending 36% on copying molecule objects as well.

nyee commented 11 years ago

Do we know how fast Java would be for the Cineole job? Based on my experience with JP10, I would expect less than a day, maybe even less than 6 hours. It might be good to run it and see.

rwest commented 11 years ago

If you can't make things faster, you can try doing them less often. Saving restart is a good example for this type of "optimization". It may be possible to cythonize some of the pickle helper functions, and/or make it save fewer derived properties and recalculate more when unpicking. (Which we do less often).

I had a look at Molecule.copy(deep=True) the other day. It seemed pretty well cythonized, so not a lot of low hanging fruit there. May be able to squeeze a little out. Or try to think of ways to avoid needing it.

github-actions[bot] commented 1 year ago

This issue is being automatically marked as stale because it has not received any interaction in the last 90 days. Please leave a comment if this is still a relevant issue, otherwise it will automatically be closed in 30 days.

ReactionMechanismGenerator / RMG-Py

General speedup #78