Closed yoelcortes closed 1 year ago
Thanks @yoelcortes for letting us know about this change, seems to be very helpful in improving the speed if it would work. My main concerns (kind of related and have to do with how QSDsan works):
One question that's not entirely clear to me - if sparse arrays don't store 0 values, how do they deal with values that turn into 0/change from 0 to non-0s during simulation?
Then going back to the reason that make you do the switch - the oilcane biorefinery is becoming a beast in itself, I'm not sure if splitting some of the modules would make it easier simulation-wise (e.g., split into sugarcane-only, oilcane-only, oilcane-oilsorghum combined), maybe this would break your analysis so just a thought.
Finally, it's nice to talk about this during the workshop, but unfortunately my entire next week is pretty much all booked (and I'll be away Jan 19/20), I should have some relief between Jan 20 and early Feb (if the comments on the WWT manuscript aren't terrible) to have a look at this. If it's an easy transition, then it'll probably be OK, but if it means reworking the fundamentals of dynamic simulation, I and very likely @joyxyz1994 will take a much longer time to look at this and might try to find a way for both property and sparse arrays to work.
Thanks!
@yalinli2,
Thanks so much for the quick reply and noting your concerns. We can definitely hold the pull request until February (once you get less busy). The transition to the new sparse arrays should be seamless. I made a couple of small compatibility changes so that you don't need to use the memory-speed
branch of biosteam or make any updates to QSDsan to get tests passing. I think the following points should help clarify how this is possible:
Here is a pic to show that all tests are passing on QSDsan and Exposan:
The error in test_exposan is not related to changes in thermosteam (maybe something wrong with my version):
I'll use docs made from the memory-speed branch for the workshop. Really no worries if you don't have time to review. Honestly, it's better to have a detailed review from you later than a quick one earlier.
Thanks!
@yalinli2, great idea on keeping both numpy arrays and sparse arrays as options. I'll keep numpy arrays as the default, but the default can be saved/changed on the computer with biosteam.settings
. I am changing the pull to a draft to reflect how it still needs more work.
Thanks!
Update: I tried the idea to be able to use both sparse and numpy arrays, but it leads to a lot of code complexity and overhead so I do not think it is worth the investment or future maintenance work. The reason is that sparse arrays have methods optimized for the places where thermosteam/biosteam needs nonzero chemicals (for mixture properties and equilibrium) and we would need two sets of code (one for sparse arrays and one for normal arrays). I benchmarked code for getting mixture properties and VLE when all chemicals defined are nonzero (when numpy should work the best) and found that performance was about equal.
I think I convinced myself further that this change is better for biosteam. I'm reopening the pull request since I do not plan on supporting flows in numpy arrays.
Thanks,
Thank @yoelcortes for the update and trying out making both work - as of right now it seems that there's only one minor thing that needs to be updated on the QSDsan side - but I'll follow up here after a closer look!
OK, if there are any compatibility issues or missing methods with sparse arrays just let me know and I can fix it and add tests for it.
Thanks!
@yalinli2,
Thanks so much for testing this out. Regarding property_arrays, yes, it should still work without any updates. However, in thermosteam we switched to DictionaryView objects to take advantage of the structure of SparseArray objects. This saves the trouble of creating a FreeProperty object for each chemical, which can be time consuming. Here is an example implementation of how it would look like for the ComponentConcentrationIndexer
in QSDsan:
from thermosteam import indexer
from thermosteam.base import DictionaryView, SparseVector
@property
def group_conc_compositions(self):
raise AttributeError('cannot set group by concentration')
ComponentConcentrationIndexer, ConcentrationIndexer = \
indexer._new_Indexer('Concentration', 'mg/L', group_conc_compositions)
ChemicalMolarFlowIndexer = indexer.ChemicalMolarFlowIndexer
class ConcentrationDict(DictionaryView):
__slots__ = ('dct', 'F_vol', 'MW', 'phase', 'phase_container')
def __init__(self, dct, F_vol, MW, phase, phase_container):
self.dct = dct
self.F_vol = F_vol
self.MW = MW
self.phase = phase
self.phase_container = phase_container
def output(self, index, value):
'''Concentration flows, in mg/L (g/m3).'''
f_mass = value * self.MW[index]
phase = self.phase or self.phase_container.phase
if phase != 'l':
raise AttributeError('Concentration only valid for liquid phase.')
V_sum = self.F_vol
return 1000. * f_mass / V_sum
def input(self, index, value):
raise AttributeError('Cannot set flow rate by concentration.')
def by_conc(self, TP):
'''
Return a ComponentConcentrationIndexer that references this object's
molar and volume data (volume relies on molar).
Parameters
----------
TP : ThermalCondition
'''
chemicals = self._chemicals
phase = self._phase
F_vol = self.by_volume(TP).data.sum()
self._data_cache['conc', TP] = \
conc = ComponentConcentrationIndexer.from_data(
SparseVector.from_dict(
ConcentrationDict(self.data.dct, F_vol, self.chemicals.MW, None, phase),
chemicals.size
),
phase, chemicals,
False
)
return conc
from biosteam import Stream, settings
settings.set_thermo(['Water', 'Ethanol'])
s = Stream(Water=2, Ethanol=3)
iconc = by_conc(s.imol, s.thermal_condition)
iconc.show()
# ChemicalConcentrationIndexer (mg/L):
# (l) Water 1.7e+05
# Ethanol 6.52e+05
Regarding stream.imass[group]
, this is now gives a Python float instead of a numpy.float64 object, which has methods like sum
and mean
(so it always returned a number, but it seemed like it could be an array):
>>> import numpy as np
>>> arr = np.ones(3)
>>> a = arr[0]
>>> print(a)
1.0
>>> a.sum()
1.0
>>> a.sum().sum()
1.0
Python floats are faster when not working with arrays and are not parasitic like numpy floats (which get propagated in calculations).
To get each chemical value in a group, you can use either one of these:
stream.imol[[i.ID for i in chemicals.group]]
stream.mol[chemicals.get_index(group)]
Let me know if you have any more questions! Thanks,
Thanks so much @yoelcortes ! All make sense and I've updated the concentration indexer as you suggested. The changes look good to me now. I also tried @GaYeongKim and @joyxyz1994 's systems and they all seem fine to me, though @joyxyz1994 has some hard ddls this week so she needs some time to confirm the changes are all good.
Sounds good. @joyxyz1994, looking forward to your review, but no rush.
Merged! Thank you all :)
@yalinli2, @joyxyz1994, @sarangbhagwat
Hope you are doing well! In my simulations for oilcane, I came across a MemoryError in Python when I have two or more biorefineries loaded. ~I found the main culprit to be the arrays within stream and reaction objects~ [edit: the main culprit was stream objects accumulating, but the changes made here are still impactful in the long run], which can be super long because of all the chemicals that are defined (about 80). Property arrays, like
mass
andvol
, also don't help since each element in them have multiple values/refs. Considering that the streams in the oilcane biorefinery have, on average, 9 chemicals, I considered the option of using scipy sparse arrays (which only store non-zero data). However, these are extremely slow...Instead, I made my own sparse array class that is optimized for BioSTEAM's needs. These work just like numpy arrays, but simply do not store 0-values. Tests are added to check the math and assert compatibility with special numpy methods and indexing. We also see some speed-ups when many chemicals are defined (more scalable!):
Click me an example
```python import thermosteam as tmo tmo.settings.set_thermo( ['O2', 'N2', 'Water', 'Ethanol', 'NH3', 'Methanol', 'Glucose', 'Sucrose', 'Fructose', 'Hexane', 'Methane', 'Octanol', 'Heptane', 'Octane', 'Propanol', 'Hexanol', 'Ethane', 'Propane', 'LacticAcid'] ) s1 = tmo.Stream('s1', Water=2, Ethanol=3) s2 = tmo.Stream('s1', Water=2, Ethanol=1, Methanol=3, Propanol=2) def Hs(): for s in (s1, s2): s.T = 300 s.H s.T = 310 s.H def Hs_cache(): for s in (s1, s2): s.T = 310 s.H def vle(): for s in (s1, s2): s.vle(V=0.5, P=101325) s.vle(T=350, P=101325) def vle_cache(): for s in (s1, s2): s.vle(T=350, P=101325) # %% With Sparse Arrays # Time stream creation # %timeit tmo.Stream(None, Water=2, Ethanol=3) # 11.8 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) # Time vapor-liquid equilibrium # %timeit vle() # 24.1 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) # %timeit vle_cache() # 3.41 ms ± 7.66 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # Time property retrieval # %timeit Hs() # 139 µs ± 406 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) # %timeit Hs_cache() # 13.2 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) # %% With Dense (NumPy) Arrays # Time stream creation # %timeit tmo.Stream(None, Water=2, Ethanol=3) # 13.7 µs ± 214 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) # Time vapor-liquid equilibrium # %timeit vle() # 24.1 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) # %timeit vle_cache() # 3.52 ms ± 40.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # Time property retrieval # %timeit Hs() # 147 µs ± 1.52 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) # %timeit Hs_cache() # 15.9 µs ± 919 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) ```One example of how the new sparse array is optimized is that indexing elements in a 1-d array always gives a dense (numpy) array because the assumption is that those will be used for calculations and not for storing data.
As we continue to build bigger biorefineries and make more complex analyses, I think moving to sparse arrays is a must. The only draw back is that getting/setting values is now slower, but still much faster than Scipy sparse arrays.
Click me for an example
```python >>> from scipy.sparse import lil_array >>> import numpy as np >>> scipy_arr = lil_array(np.zeros([1, 100])) # Most efficient for changing data (other options include csr and dok arrays) >>> %timeit scipy_arr[:] = 0 50.6 µs ± 284 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) >>> from thermosteam.base import sparse_array >>> import numpy as np >>> biosteam_arr= sparse_array(np.zeros([1, 100])) >>> %timeit biosteam_arr[:] = 0 1.51 µs ± 2.49 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) >>> from thermosteam.base import sparse_array >>> import numpy as np >>> numpy_arr = np.zeros([1, 100]) >>> %timeit numpy_arr[:] = 0 399 ns ± 3.16 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ```All tests (oilcanes biorefineries included) are passing locally on the bioindustrial-park
cane
, and biosteam/thermosteammemory-speed
branches.I hope to merge before the workshop on Jan 20, if you have a chance to review by then. Any comments/concerns are greatly appreciated. Thanks!