yaml writer slows down execution

ChrisBNEU commented 1 year ago

Bug Description

It appears that the yaml writer uses a lot of overhead during the execution of an rmg run. I ran a modified ethane pyrolysis input file with cProfile. The only thing I changed in rmg was I did not write a yaml file for rms every execution. I did that by commenting out this line in main.py:

self.attach(RMSWriter(self.output_directory))

the difference in execution times were:

with RMSWriter listener: 15 min 15 sec
without RMSWriter listener: 7min 35 sec

I attached the profiles and the rmg logs below. It is possible this becomes insignificant at longer execution times, I have not tried it. The Profiler graph shows it executing every time the model is enlarged, so it is also possible it gets worse with larger mechanisms.

How To Reproduce

Comment out the line specified above, then run the profiler on the input file I attached below (python-jl rmg.py -p input.py)

Installation Information

Describe your installation method and system information.

macOS 10.15.7
Installed from source via conda
RMG version information:
- RMG-Py: 3.0.0-1437-gef83a1c0e
- RMG-database: 3.1.0-616-g75fabcb6c

Additional Context

input file used: input.py.zip

profiles: profile_with_listener.pdf profile_without_listener.pdf

rmg logs: with_listener.log without_listener.log

JacksonBurns commented 1 year ago

Thanks for the thorough report @ChrisBNEU. I think a short term solution would be to add a way to optionally disable this writer from the input, but long term would be to use multiprocessing to spawn a task that runs the yaml writing parallel to the simulation itself. We could also look at a different output format that writes faster, like JSON.

rwest commented 1 year ago

I found this an interesting read. (One of those stack overflow posts where the best answer is in fact the one with the fewest upvotes). "How is it that json serialization is so much faster than yaml serialization in Python?" I'm sure we could make it much faster with a little effort. Especially as we're outputting the same few things every time and could hard-code the templates. Worth looking into as we switch to Cantera yaml output (#2078)

mjohnson541 commented 1 year ago

I think at least a significant part of this is due to the fact that we're generating smiles twice for every species when generating the yaml: https://github.com/ReactionMechanismGenerator/RMG-Py/blob/ef83a1c0e33854aaf749484755cc02a943c1ab01/rmgpy/yml.py#L101C5-L101C5 and https://github.com/ReactionMechanismGenerator/RMG-Py/blob/ef83a1c0e33854aaf749484755cc02a943c1ab01/rmgpy/yml.py#L115C58-L115C58.

I think if we use Molecule.smiles instead this will speed up yaml generation quite significantly.

ChrisBNEU commented 1 year ago

I linked a pr for this, but it is the temporary fix we mentioned. I just made an input file arg. Should it be linked to this issue or do we want to leave it open/move it to a discussion after it's merged?

JacksonBurns commented 1 year ago

I linked a pr for this, but it is the temporary fix we mentioned. I just made an input file arg. Should it be linked to this issue or do we want to leave it open/move it to a discussion after it's merged?

I think with a191fbd we can allow #2508 to close this (assuming that the patch works)

ChrisBNEU commented 1 year ago

I profiled the same run, with the updated code in yml.py that Jackson added. I think the use of molecule.smiles change helps, the % of the overhead time used by write_yml went from 46% to 30%. Unfortunately, it still appears that the yaml write takes a lot longer than writing 1 file at the end. I included the profiles and logs if people are curious. I think with #2508 we are covered, if people want it to run quicker they can just disable the yaml write for now, and people can refer back to this issue if they want to make the yaml write faster.

Profile_with_yaml_each_iter.pdf profile_yaml_at_end.pdf

log_with_yaml_each_iter.log log_yaml_at_end.log

mjohnson541 commented 1 year ago

Okay, I took a little bit closer look at the logs. The edge in this ethane pyrolysis system is incredibly small (since ethane is a small molecule) with edge core ratios of only ~3 for species and ~1.5 for reactions. I think this significantly reduces the reaction generation costs which is putting them on the same order as the simulation and the yaml file generation as this stage in mechanism generation.

I'm still surprised to see yaml still costs us so much to generate, but my suspicion is that this won't matter in performance critical runs as simulation costs and reaction generation costs will be much higher in those cases while the yaml generation will scale linearly with core size. I wonder if there's a clever way instead of generating an entirely new yaml just copy the last yaml file and "append" only the new species/reactions.

JacksonBurns commented 1 year ago

We can cache the results of obj_to_dict which should speed things up - going to make another quick commit.

github-actions[bot] commented 1 year ago

This issue is being automatically marked as stale because it has not received any interaction in the last 90 days. Please leave a comment if this is still a relevant issue, otherwise it will automatically be closed in 30 days.

JacksonBurns commented 1 year ago

Given the significance of the slowdown and the open PR which (nearly) fixes it, I'm going to label this as a bug.

ReactionMechanismGenerator / RMG-Py