Use sparse arrays instead of dense (NumPy) arrays and deprecate property arrays

yoelcortes commented 1 year ago

@yalinli2, @joyxyz1994, @sarangbhagwat

Hope you are doing well! In my simulations for oilcane, I came across a MemoryError in Python when I have two or more biorefineries loaded. ~I found the main culprit to be the arrays within stream and reaction objects~ [edit: the main culprit was stream objects accumulating, but the changes made here are still impactful in the long run], which can be super long because of all the chemicals that are defined (about 80). Property arrays, like mass and vol, also don't help since each element in them have multiple values/refs. Considering that the streams in the oilcane biorefinery have, on average, 9 chemicals, I considered the option of using scipy sparse arrays (which only store non-zero data). However, these are extremely slow...

Instead, I made my own sparse array class that is optimized for BioSTEAM's needs. These work just like numpy arrays, but simply do not store 0-values. Tests are added to check the math and assert compatibility with special numpy methods and indexing. We also see some speed-ups when many chemicals are defined (more scalable!):

Click me an example

```python import thermosteam as tmo tmo.settings.set_thermo( ['O2', 'N2', 'Water', 'Ethanol', 'NH3', 'Methanol', 'Glucose', 'Sucrose', 'Fructose', 'Hexane', 'Methane', 'Octanol', 'Heptane', 'Octane', 'Propanol', 'Hexanol', 'Ethane', 'Propane', 'LacticAcid'] ) s1 = tmo.Stream('s1', Water=2, Ethanol=3) s2 = tmo.Stream('s1', Water=2, Ethanol=1, Methanol=3, Propanol=2) def Hs(): for s in (s1, s2): s.T = 300 s.H s.T = 310 s.H def Hs_cache(): for s in (s1, s2): s.T = 310 s.H def vle(): for s in (s1, s2): s.vle(V=0.5, P=101325) s.vle(T=350, P=101325) def vle_cache(): for s in (s1, s2): s.vle(T=350, P=101325) # %% With Sparse Arrays # Time stream creation # %timeit tmo.Stream(None, Water=2, Ethanol=3) # 11.8 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) # Time vapor-liquid equilibrium # %timeit vle() # 24.1 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) # %timeit vle_cache() # 3.41 ms ± 7.66 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # Time property retrieval # %timeit Hs() # 139 µs ± 406 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) # %timeit Hs_cache() # 13.2 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) # %% With Dense (NumPy) Arrays # Time stream creation # %timeit tmo.Stream(None, Water=2, Ethanol=3) # 13.7 µs ± 214 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) # Time vapor-liquid equilibrium # %timeit vle() # 24.1 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) # %timeit vle_cache() # 3.52 ms ± 40.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # Time property retrieval # %timeit Hs() # 147 µs ± 1.52 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) # %timeit Hs_cache() # 15.9 µs ± 919 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) ```

One example of how the new sparse array is optimized is that indexing elements in a 1-d array always gives a dense (numpy) array because the assumption is that those will be used for calculations and not for storing data.

As we continue to build bigger biorefineries and make more complex analyses, I think moving to sparse arrays is a must. The only draw back is that getting/setting values is now slower, but still much faster than Scipy sparse arrays.

Click me for an example

```python >>> from scipy.sparse import lil_array >>> import numpy as np >>> scipy_arr = lil_array(np.zeros([1, 100])) # Most efficient for changing data (other options include csr and dok arrays) >>> %timeit scipy_arr[:] = 0 50.6 µs ± 284 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) >>> from thermosteam.base import sparse_array >>> import numpy as np >>> biosteam_arr= sparse_array(np.zeros([1, 100])) >>> %timeit biosteam_arr[:] = 0 1.51 µs ± 2.49 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) >>> from thermosteam.base import sparse_array >>> import numpy as np >>> numpy_arr = np.zeros([1, 100]) >>> %timeit numpy_arr[:] = 0 399 ns ± 3.16 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ```

All tests (oilcanes biorefineries included) are passing locally on the bioindustrial-park cane, and biosteam/thermosteam memory-speed branches.

I hope to merge before the workshop on Jan 20, if you have a chance to review by then. Any comments/concerns are greatly appreciated. Thanks!

yalinli2 commented 1 year ago

Thanks @yoelcortes for letting us know about this change, seems to be very helpful in improving the speed if it would work. My main concerns (kind of related and have to do with how QSDsan works):

QSDsan uses property arrays (we made a new class for concentration), not sure if the transition to the spare arrays would be smooth
The whole dynamic simulation relies on re-constructing the state of units and streams based on the # of chemicals loaded - using spare array (I haven't really tried) seems like will break the whole deal, I guess we can see if the tests on QSDsan would pass (and how easy/hard the fix would be)

One question that's not entirely clear to me - if sparse arrays don't store 0 values, how do they deal with values that turn into 0/change from 0 to non-0s during simulation?

Then going back to the reason that make you do the switch - the oilcane biorefinery is becoming a beast in itself, I'm not sure if splitting some of the modules would make it easier simulation-wise (e.g., split into sugarcane-only, oilcane-only, oilcane-oilsorghum combined), maybe this would break your analysis so just a thought.

Finally, it's nice to talk about this during the workshop, but unfortunately my entire next week is pretty much all booked (and I'll be away Jan 19/20), I should have some relief between Jan 20 and early Feb (if the comments on the WWT manuscript aren't terrible) to have a look at this. If it's an easy transition, then it'll probably be OK, but if it means reworking the fundamentals of dynamic simulation, I and very likely @joyxyz1994 will take a much longer time to look at this and might try to find a way for both property and sparse arrays to work.

Thanks!

yoelcortes commented 1 year ago

@yalinli2,

Thanks so much for the quick reply and noting your concerns. We can definitely hold the pull request until February (once you get less busy). The transition to the new sparse arrays should be seamless. I made a couple of small compatibility changes so that you don't need to use the memory-speed branch of biosteam or make any updates to QSDsan to get tests passing. I think the following points should help clarify how this is possible:

SparseArray objects encapsulate a list of SparseVector objects to represent a 2-d array. A SparseVector object represents a 1-d array using a dictionary of nonzero items (index and value) and an integer for the size. When trying to retrieve a value not in the dictionary, it returns 0. When setting a value to 0., it deletes the item (if any).
You can still use property_arrays without any issues. However, thermosteam uses a new DictionaryView class for accessing data directly from dictionaries in sparse vectors (instead of using a property_array).
The analyses I'm doing uses two biorefineries at a time to solve what technological performance is required to compete with conventional technologies. I'm pretty happy with how the code is setup and have not had any issues. Each biorefinery system is already split up into different modules can be created independent of each other.

Here is a pic to show that all tests are passing on QSDsan and Exposan:

The error in test_exposan is not related to changes in thermosteam (maybe something wrong with my version):

I'll use docs made from the memory-speed branch for the workshop. Really no worries if you don't have time to review. Honestly, it's better to have a detailed review from you later than a quick one earlier.

Thanks!

yoelcortes commented 1 year ago

@yalinli2, great idea on keeping both numpy arrays and sparse arrays as options. I'll keep numpy arrays as the default, but the default can be saved/changed on the computer with biosteam.settings. I am changing the pull to a draft to reflect how it still needs more work.

Thanks!

yoelcortes commented 1 year ago

Update: I tried the idea to be able to use both sparse and numpy arrays, but it leads to a lot of code complexity and overhead so I do not think it is worth the investment or future maintenance work. The reason is that sparse arrays have methods optimized for the places where thermosteam/biosteam needs nonzero chemicals (for mixture properties and equilibrium) and we would need two sets of code (one for sparse arrays and one for normal arrays). I benchmarked code for getting mixture properties and VLE when all chemicals defined are nonzero (when numpy should work the best) and found that performance was about equal.

Click me an example

```python import thermosteam as tmo tmo.settings.set_thermo( ['Water', 'Ethanol', 'Methanol', 'Propanol'] ) s1 = tmo.Stream(None, Water=2, Ethanol=3) s2 = tmo.Stream(None, Water=2, Ethanol=1, Methanol=3, Propanol=2) def create_streams(): tmo.Stream(None, Water=2, Ethanol=3) tmo.Stream(None, Water=2, Ethanol=1, Methanol=3, Propanol=2) def Cs(): for s in (s1, s2): s.T = 300 s.C s.T = 310 s.C def vle(): for s in (s1, s2): s.vle(V=0.5, P=101325) H = s.H s.vle(T=350, P=101325) s.vle(H=H, P=101325) Cs(); vle() # %% With Sparse Arrays # Time stream creation # %timeit create_streams() # 27 µs ± 467 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) # Time vapor-liquid equilibrium # %timeit vle() # 49 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # Time property retrieval # %timeit Cs() # 98.2 µs ± 408 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) # %% With Dense (NumPy) Arrays # Time stream creation # %timeit create_streams() # 32 µs ± 793 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) # Time vapor-liquid equilibrium # %timeit vle() # 49.1 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # Time property retrieval # %timeit Cs() # 105 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) ```

I think I convinced myself further that this change is better for biosteam. I'm reopening the pull request since I do not plan on supporting flows in numpy arrays.

Thanks,

yalinli2 commented 1 year ago

Thank @yoelcortes for the update and trying out making both work - as of right now it seems that there's only one minor thing that needs to be updated on the QSDsan side - but I'll follow up here after a closer look!

yoelcortes commented 1 year ago

OK, if there are any compatibility issues or missing methods with sparse arrays just let me know and I can fix it and add tests for it.

Thanks!

yoelcortes commented 1 year ago

@yalinli2,

Thanks so much for testing this out. Regarding property_arrays, yes, it should still work without any updates. However, in thermosteam we switched to DictionaryView objects to take advantage of the structure of SparseArray objects. This saves the trouble of creating a FreeProperty object for each chemical, which can be time consuming. Here is an example implementation of how it would look like for the ComponentConcentrationIndexer in QSDsan:

from thermosteam import indexer
from thermosteam.base import DictionaryView, SparseVector

@property
def group_conc_compositions(self):
    raise AttributeError('cannot set group by concentration')

ComponentConcentrationIndexer, ConcentrationIndexer = \
    indexer._new_Indexer('Concentration', 'mg/L', group_conc_compositions)

ChemicalMolarFlowIndexer = indexer.ChemicalMolarFlowIndexer

class ConcentrationDict(DictionaryView):
    __slots__ = ('dct', 'F_vol', 'MW', 'phase', 'phase_container')

    def __init__(self, dct, F_vol, MW, phase, phase_container):
        self.dct = dct
        self.F_vol = F_vol
        self.MW = MW
        self.phase = phase
        self.phase_container = phase_container

    def output(self, index, value):
        '''Concentration flows, in mg/L (g/m3).'''
        f_mass = value * self.MW[index]
        phase = self.phase or self.phase_container.phase
        if phase != 'l':
            raise AttributeError('Concentration only valid for liquid phase.')
        V_sum = self.F_vol
        return 1000. * f_mass / V_sum

    def input(self, index, value):
        raise AttributeError('Cannot set flow rate by concentration.')

def by_conc(self, TP):
    '''
    Return a ComponentConcentrationIndexer that references this object's
    molar and volume data (volume relies on molar).
    Parameters
    ----------
    TP : ThermalCondition
    '''
    chemicals = self._chemicals
    phase = self._phase
    F_vol = self.by_volume(TP).data.sum()
    self._data_cache['conc', TP] = \
    conc = ComponentConcentrationIndexer.from_data(
        SparseVector.from_dict(
            ConcentrationDict(self.data.dct, F_vol, self.chemicals.MW, None, phase),
            chemicals.size
        ),
        phase, chemicals,
        False
    )
    return conc

from biosteam import Stream, settings
settings.set_thermo(['Water', 'Ethanol'])
s = Stream(Water=2, Ethanol=3)
iconc = by_conc(s.imol, s.thermal_condition)
iconc.show()
#  ChemicalConcentrationIndexer (mg/L):
#   (l) Water    1.7e+05
#       Ethanol  6.52e+05

Regarding stream.imass[group], this is now gives a Python float instead of a numpy.float64 object, which has methods like sum and mean (so it always returned a number, but it seemed like it could be an array):

>>> import numpy as np
>>> arr = np.ones(3)
>>> a = arr[0]
>>> print(a)
1.0
>>> a.sum()
1.0
>>> a.sum().sum()
1.0

Python floats are faster when not working with arrays and are not parasitic like numpy floats (which get propagated in calculations).

To get each chemical value in a group, you can use either one of these:

stream.imol[[i.ID for i in chemicals.group]]
stream.mol[chemicals.get_index(group)]

Let me know if you have any more questions! Thanks,

yalinli2 commented 1 year ago

Thanks so much @yoelcortes ! All make sense and I've updated the concentration indexer as you suggested. The changes look good to me now. I also tried @GaYeongKim and @joyxyz1994 's systems and they all seem fine to me, though @joyxyz1994 has some hard ddls this week so she needs some time to confirm the changes are all good.

yoelcortes commented 1 year ago

Sounds good. @joyxyz1994, looking forward to your review, but no rush.

yoelcortes commented 1 year ago

Merged! Thank you all :)

BioSTEAMDevelopmentGroup / thermosteam

Use sparse arrays instead of dense (NumPy) arrays and deprecate property arrays #87