CalebBell / chemicals

chemicals: Chemical database of Chemical Engineering Design Library (ChEDL)
MIT License
186 stars 36 forks source link

Simplifying constant property search algorithms #2

Closed yoelcortes closed 4 years ago

yoelcortes commented 4 years ago

Some functions to retrieve constant properties such as critical points have some redundancies. For example:

The algorithm-improvements branch is taking care of this.

yoelcortes commented 4 years ago

Although most of these functions are definitely not bottlenecks, the changes should be helpful in the long term:

>>> from chemicals import critical_point_temperature
>>> Tc = critical_point_temperature('64-17-5') # ignore loading dfs
>>> %timeit critical_point_temperature('64-17-5')
7.3 µs ± 49.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> from thermo import Tc
>>> %timeit Tc('64-17-5')
30.9 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
CalebBell commented 4 years ago

HiYoel,

That's an impressive speedup! The algorithmic thought had occurred to me before but never seemed worth the complexity. However, you have minimized the risk of that by writing new functions instead of doing it all in the function. Tc and friends were among the very first functions ever written in thermo, for sure they have room for improvement. I am hesitant to use a custom exception for providing a bad method. ValueError is already present and doesn't have to be imported or created. I'm not opposed to it fundamentally - it's just that the ValueError provides a fair error message, and the best code is often code that doesn't need to be written. I see ValueError is not raised by similar functions elsewhere, so there is definitely work either way. An exceptions file is a good idea; I have some in utils that could be moved over.

Have you figured out running the tests? Tests can seem like a pain because they break but they have been crucial to my development efforts. With good line coverage, you know the function is written correctly. These changes break 5 of the tests.

For example the command for me is: py.test-3 --cov-report html --cov=chemicals --doctest-modules

However if you are running windows pytest will be installed under a different name. Which distribution are you on?

yoelcortes commented 4 years ago

Good suggestion! My rule of thumb for creating new exception classes is to only create them when exception handling may be needed for outer level functions. For example, in this case the "retrieve_from_df_dict" may also return a ValueError when the wrong key is given. This would make exception handling difficult when using this function... In the end, I opted for some refactoring to not have to create a new exception class. I think it also made the algorithm better too jaja.

Also, I think you may have seen flexsolve's InfeasibilityRegion error. That would be another special case; where a new exception could be used to interface with other methods.

This the first time I use pytest (I usually use doctests). I think an unrelated error may have stopped the tests. But since the test_critical.py module is first, it looks like all my changes passed:

image

Please checkout my branch and let me know you agree with the changes. I'd like to merge my branch :)

CalebBell commented 4 years ago

Hi Yoel,

To summarize the things I like about this branch:

This is really great work.

About the missing file - my apologies, it seems my global .gitignore file was ignoring some .zip and .gz files which were supposed to have been added to the repository. That explains why you received that error running pytest. I believe your pytest didn't run any tests, but you should be able to run it now. There are still currently 5 failing tests. I know they are failing because some them test a feature you took out - SURF from Tc and Pc. That can be seen here: https://github.com/CalebBell/chemicals/blob/master/tests/test_critical.py#L206 . I am fine with taking SURF out; I would be fine taking it out of Vc also. The test for Vc is failing because of a copy/paste typo. The test test_relationships failing because NONE has been removed from the return value of available methods. This is not consistent with the other methods. I am fine taking that thing out; it was kind of stupid. But it will need to be removed in a LOT of places if that's where you want to go. There are a couple more failures, I think bugs in retrieve_from_df_dict - calls like Tc('108-88-3', Method='Yaws') are failing.

It is also possible to copy the code from the tests file to a Jupyter notebook and run them there; I do that sometimes too, as it can be faster than re-running pytest. I hope you can see why having those tests is valuable. I normally run the tests every couple of minutes while developing. That way, normally you know exactly what change cause the breakage and it is easier to fix. One more item before we merge this branch is to update the documentation for the changed methods with the new changes; if SURF is removed from the methods it needs to go from the documentation too.

Other than these minor issues, I look forward to merging the branch once the fixes are in!

yoelcortes commented 4 years ago

Sounds good, thanks for merging into algorithm-improvements branch! I'll merge later tonight once I get all the tests passing.

yoelcortes commented 4 years ago

@CalebBell What do you think about adding the missing Zc columns (by definition) and the missing Vc column (using Ihmels) to the csv files?

I can add a comment on this to where I load the files and add these columns to the tests (following your style).

CalebBell commented 4 years ago

Hi Yoel,

Adding the missing Zc column to some datafiles by definition is fine. I would prefer whichever solution loaded faster, although it will also make editing those files harder if mistakes were ever made. I have no particular opinion and have only needed to edit the IUPACOrganicCriticalProps file.

I wouldn't want to mix literature-reported data with estimated data however. I would be happy if the critical_surface method from Vc was removed. Removing it would make it clearer that these functions were for returning literature-reported data, not calculating them.

yoelcortes commented 4 years ago

Thanks for the quick reply, I'm adding the generalized methods to the other modules too and found many functions that don't have an "IgnoreMethods" argument. This argument is not explicitly tested in the docs anymore either (since we removed SURF and COMBINED). Personally, I never needed to use it. I'd like to propose to remove it for all functions. Any thoughts?

CalebBell commented 4 years ago

Hi Yoel, I concur with your assessment. The IgnoreMethods was, I guess, a workaround for the fact I didn't want those methods to be done anymore. That sounds great!

yoelcortes commented 4 years ago

Cool, since I'm modifying all these functions... I have one final proposal for them. Although there is no real problem with the AvailableMethods and Methods names. They are capitalized, suggesting they require classes. The word method implies the function is evaluating, not retrieving data. Also, in the Notes section, the docs refer to the options as "sources". To address these points, I'd like to make the following name changes:

Once I get all these functions looking good, I'll merge the branch and focus on the next big thing :)

CalebBell commented 4 years ago

Hi Yoel,

I have regretted the capitalization of those variables for a long time, so lowercase/understcore gets a +1 in my book. I agree that for the current setup of chemical constants, "source" is a more accurate description than "method". I would prefer not to change to "source" however, for two reasons. The first is if there is a need to add calculation-based methods to the function which are unpublished. The second is for consistency will all the other functions I have written using the "Method" and "AvailableMethod" variables, not just in chemicals but in fluids and ht, which are doing fine without any need for a re-write. I could introduce a nice long depreciation cycle to replace those variables in those libraries with the pep8-style variables though. So I suggest: AvailableMethods -> get_methods Method -> method

yoelcortes commented 4 years ago

Hi Caleb,

So I fixed up all the constant property functions that are currently uploaded and merged my changes to the master branch (all tests passing). Not sure if you've looked into adding the reactions.py module, but I'm guessing you have since you've included the reference phase and heat of formation for the elements.py. So I've done some work on this in thermosteam that I'd like to share with you. The idea is to retrieve the heat of formation at a given phase (regardless of whether the chemical is at that phase in 298K and 1atm), removing some ambiguity and allow users to use any datafile (if set in Hf_sources). Also, I noticed that the "API TDB Albahri Hf.tsv" file is all for gases (I actually checked every single one of them jaja). So a name change to "API TDB Albahri Hf (g).tsv" might be appropriate:

The lookup algorithm could look like this:

Hf(CASRN, phase_ref, Hvap_298K=None, Hfus=None,
                              method=None):
    if phase_ref == 'l':
        if CASRN in elemental_liquid_standard_states: return 0.
    elif phase_ref == 'g':
        if CASRN in elemental_gas_standard_states: return 0. 
    elif phase_ref == 's':
        if CASRN in elemental_solid_standard_states: return 0.
    else:
        raise ValueError('invalid phase ' + repr(phase))
    Hf = None
    if not method or method in Hf_gas_methods:
        Hf = heat_of_formation_gas(CASRN, method)
        if Hf:
            Hf = Hf_at_phase_ref(Hf, 'g', phase_ref, Hvap_298K, Hfus)
            if Hf: return Hf
    if not (method or Hf) or method in Hf_liquid_methods:
        Hf = heat_of_formation_liquid(CASRN, method)
        if Hf:
            Hf = Hf_at_phase_ref(Hf, 'l', phase_ref, Hvap_298K, Hfus)
            if Hf: return Hf 
    if method:
        Hf, phase = retrieve_from_df_dict(heat_of_formation_sources, CASRN,
                                          ['Hf_298K', 'phase'], method)     
    else:
        Hf, phase = retrieve_any_from_df_dict(heat_of_formation_sources, CASRN,
                                          ['Hf_298K', 'phase'])     
    if Hf:
        Hf = Hf_at_phase_ref(Hf, phase, phase_ref, Hvap_298K, Hfus)
        if Hf: return Hf    

def Hf_at_phase_ref(Hf, phase, phase_ref, Hvap_298K, Hfus):
    if phase == phase_ref: return Hf
    elif phase == 'g' and phase_ref == 'l':
        if Hvap_298K: return Hf - Hvap_298K
    elif phase == 'g' and phase_ref == 's':
        if Hvap_298K and Hfus: return Hf - Hvap_298K - Hfus
    elif phase == 'l' and phase_ref == 'g':
        if Hvap_298K: return Hf + Hvap_298K
    elif phase == 'l' and phase_ref == 's':
        if Hfus: return Hf - Hfus
    elif phase == 's' and phase_ref == 'l':
        if Hfus: return Hf + Hfus
    elif phase == 's' and phase_ref == 'g':
        if Hvap_298K and Hfus: return Hf + Hvap_298K + Hfus
    else: raise ValueError('invalid phase ' + repr(phase))

But this would only work if the data bases are already at 298K (not at STP). Is this the case? If not, we may also return both the temperature and phase and let the user manage this (I can already compute the enthalpy difference of a chemical; so this may actually be preferred for me).

CalebBell commented 4 years ago

Hi Yoel,

Some of this, like checking the elements, looks great. I know the reaction.py file is messy and unfinished; I have rarely used it, and the data sources are not very numerous or comprehensive as well.

One thing I am trying to do is untangle the methods in chemicals. I would like one essentially database lookup call to not do another. Because of this, I would not like to try to reconcile heats of formation between different phases in this function; there are other contributions than just the heat of vaporization and fusion as well in the non-ideal world. I am hoping for the design of this type of function to not require further changes in the future, and only have more data sources

I think there is a place for a function that can do that conversion outside of the main lookup function, which will not look up any values itself, however! Here is the one I had, but it didn't handle solids.


def Hf_basis_converter(Hvapm, Hf_liq=None, Hf_gas=None):
    r'''This function converts a liquid or gas enthalpy of formation to the
    other. This is useful, as thermodynamic packages often work with ideal-
    gas as the reference state and require ideal-gas enthalpies of formation.

    Parameters
    ----------
    Hvapm : float
        Molar enthalpy of vaporization of compound at 298.15 K or (unlikely)
        the reference temperature, [J/mol]
    Hf_liq : float, optional
        Enthalpy of formation of the compound in its liquid state, [J/mol]
    Hf_gas : float, optional
        Enthalpy of formation of the compound in its ideal-gas state, [J/mol]

    Returns
    -------
    Hf_calc : float, optional
        Enthalpy of formation of the compound in the other state to the one
        provided, [J/mol]

    Notes
    -----

    Examples
    --------
    Calculate the ideal-gas enthalpy of formation for water, from its standard-
    state (liquid) value:

    >>> Hf_basis_converter(44018, Hf_liq=-285830)
    -241812

    Calculate the standard-state (liquid) enthalpy of formation for water, from
    its ideal-gas value:

    >>> Hf_basis_converter(44018, Hf_gas=-241812)
    -285830
    '''
    if Hf_liq is None and Hf_gas is None:
        raise ValueError("Provide either a liquid or a gas enthalpy of formation")
    if Hvapm is None or Hvapm < 0.0:
        raise ValueError("Enthalpy of formation unknown or zero")
    if Hf_liq is None:
        return Hf_gas - Hvapm
    else:
        return Hf_liq + Hvapm
CalebBell commented 4 years ago

As a side note, a really nasty issue came up.

I am not sure if you are aware, but numpy's has its own float classes. Pandas, using numpy, ends up returning these numpy data types. The numpy datatypes have a pretty big drawback: They are slower. See the table below for an example.

five = 5.0 %timeit five*five 69.4 ns ± 0.851 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) %timeit five/five 72.4 ns ± 2.01 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) %timeit five**five 128 ns ± 2.33 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

five = np.float64(5.0) %timeit five*five 156 ns ± 2.05 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) %timeit five/five 185 ns ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) %timeit five**five 216 ns ± 5.77 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

What makes this intolerable is that they are parasitic - they never go away. Put even one numpy number into a function, and it makes the return value of that function a numpy value; before you know it everything you are doing is a numpy number and speed has halved or worse.

I believe I have fixed that issue in master.

CalebBell commented 4 years ago

Numpy really incredible - don't get me wrong! It's just there are applications it's perfect for and those it's not. No worries or anything about this.

yoelcortes commented 4 years ago

I knew they used different data types, but I had no clue they were this slow! Thanks for fixing this issue. I'll go ahead and update your changes to my branch. As for the heat of formation, let's continue that thread in the new issue you made. Thanks!