CalebBell / chemicals

chemicals: Chemical database of Chemical Engineering Design Library (ChEDL)
MIT License
186 stars 36 forks source link

Investigate storing much of the data in a database, to reduce load time and memory use #42

Closed CalebBell closed 2 years ago

CalebBell commented 2 years ago

This pull request adds an alternative option to looking up constant properties in Pandas dataframes. Essentially, a common path is to use a function like Tb without a method specified; and to not use either the Tb_methods or the Tb_sources function and variable respectively. For this one common application, the only needed information is the Tb value itself.

Therefore, a script dev/generate_sqlite_database.py was created to obtain the default values for each constant property in chemicals. The resulting database is 7.3 MB, and compresses to 3.3 MB in a zip archive.

The sqlite database is always stored on disk, so there is substantial memory savings from this approach. The query time is acceptable in general. When using something like %timeit, one search is 10-15 us. However, in this case the used database pages being cached in memory. On my machine when the disk has to be accessed (as measured by doing a single read), the time is about 800 us. I wrote the database lookup to retrieve all of the pure component properties for a compound at once, and cache them in memory once accessed. The layout of sqlite is such that this is as fast as only retrieving one property.

Loading the sqlite3 module takes ~3 ms. It is delayed until needed.

There is a setting chemicals.data_reader.USE_CONSTANTS_DATABASE which defaults to True. The sqlite database can't be committed to git as it will be updated regularly and therefore bloat up the repository size unreasonably.

Included with this change is further work on the int_CAS setting earlier discussed. The files with CAS numbers using int_CAS are now expected to have the CAS number without the dashes included, i.e. 50000 instead of 50-00-0. Doing the conversion in Python turned out to be quite slow, and this way the Pandas csv reader can read them directly as integers. To make this work, all of the data files had their header corrected to use CAS as the index variable.

There are a few more minor data corrections as well.

This PR does add considerably complexity to the project, but I believe it will be appreciated and will make the project better. In the future, it may be possible to provide something for temperature dependent properties as well.

yoelcortes commented 2 years ago

@CalebBell, this is awesome. I don't think there is much complexity added. I made some minor edits here and there under the assumption that the index must be either a string or an integer. The database_constant_lookup fails in my machine every time with the following exception:

File "...\chemicals\data_reader.py", line 247, in cached_constant_lookup
    CONSTANTS_CURSOR.execute("SELECT * FROM constants WHERE `index`=?", (str(CASi),))

OperationalError: no such table: constants

I have not worked with sql before so I am not able to debug this quickly, but maybe you could help me? I added some code that prevent database lookup after the first failure related to sql searches.

Other than getting the search to work. I think this branch can be merged whenever. Feel free to revert any of my changes that aren't helpful.

Thanks!

CalebBell commented 2 years ago

Hi Yoel! I hope you are well and thank you for the feedback! It is always nice to not be alone. Thank you for taking the time to take a look! I hope this branch can improve your libraries as well as mine. I believe the error you are getting is that the database isn't on your machine. It needs to be generated first: python dev/generate_sqlite_database.py Please let me know if this doesn't work for you.

The database is easy to distribute as part of the pip release. It will take a little more work to get the conda build to ship with it. This is a big part of my reluctance to the approach; but as more and more data is added to the library this type of approach seems like the only one that doesn't unduly increase load time.

CalebBell commented 2 years ago

@yoelcortes have you been able to check if the database generation script works for you?

yoelcortes commented 2 years ago

Hi @CalebBell, sorry about that, I meant to get back to you sooner but was doing some traveling for memorial weekend. Yes, it works well on my computer. I also fixed some bugs I introduced in my past commits (which required the sql file to run). As always, your work on enhancing your ChemE libraries is much appreciated by me and tons others. I am happy I can help and contribute too :)

By the way, I like the new int_CAS parameter. Would it be a good idea to load other files in chemicals using this? I'm not sure if it would add too much overhead, what do you think? Here are some of the files that whose index are not currently converted to iCASs:

Physical Constants of Inorganic Compounds.csv Physical Constants of Organic Compounds.csv IUPACOrganicCriticalProps.tsv CRCCriticalOrganics.tsv Mathews1972InorganicCriticalProps.tsv Appendix to PSRK Revision 4.tsv PassutDanner1973.tsv Yaws Collection.tsv DIPPRPinaMartines.tsv Antoine Collection Poling.tsv Table 2-8 Vapor Pressure of Inorganic and Organic Liquids.tsv Alcock_Itkin_Horrigan_metalic_elements.tsv Wagner Original McGarry.tsv Wagner Collection Poling.tsv Antoine Extended Collection Poling.tsv VDI PPDS Boiling temperatures at different pressures.tsv Florian_Liming_RON_experimental.tsv Florian_Liming_MON_experimental.tsv Florian_Liming_RON_MON_ANN.tsv Travis_Kessler_Combustdb_RON.tsv Travis_Kessler_Combustdb_MON.tsv Travis_Kessler_Combustdb_predictions.tsv Dahmen_Marquardt_ignition_delay_IQT.tsv Poling Dipole.csv cccbdb.nist.gov Dipoles.csv Muller Supporting Info Dipoles.csv psi4_dipoles.tsv Official Global Warming Potentials 2007.tsv Official Global Warming Potentials 2014.tsv Ozone Depletion Potentials.tsv CRC logP table.tsv Syrres logP data.csv.gz PolingDatabank.tsv TRC Thermodynamics of Organic Compounds in the Gas State.tsv CRC Standard Thermodynamic Properties of Chemical Substances.tsv MuleroCachadinaParameters.tsv Jasper-Lange.tsv Somayajulu.tsv SomayajuluRevised.tsv VDI PPDS surface tensions.tsv MagalhaesLJ.tsv PolingLJ.tsv psi4_radius_of_gyrations.tsv chemsep_radius_of_gyrations.tsv psi4_linear.tsv Permittivity (Dielectric Constant) of Liquids.tsv Yaws Boiling Points.tsv OpenNotebook Melting Points.tsv Ghazerati Appendix Vaporization Enthalpy.tsv CRC Handbook Heat of Vaporization.tsv CRC Handbook Heat of Fusion.tsv Ghazerati Appendix Sublimation Enthalpy.tsv Table 2-150 Heats of Vaporization of Inorganic and Organic Liquids.tsv VDI PPDS Enthalpies of vaporization.tsv Alibakhshi one-coefficient enthalpy of vaporization.tsv API TDB Albahri Hf (g).tsv ATcT 1.112 (g).tsv ATcT 1.112 (l).tsv Yaws Hf S0 (g).tsv CRC Handbook Organic RI.csv NFPA 497 2008.tsv IS IEC 60079-20-1 2010.tsv DIPPR T_flash Serat.csv National Toxicology Program Carcinogens.tsv IARC Carcinogen Database.tsv Dutt Prasad 3 term.tsv Viswanath Natarajan Dynamic 3 term.tsv Viswanath Natarajan Dynamic 2 term.tsv Viswanath Natarajan Dynamic 2 term Exponential.tsv Table 2-313 Viscosity of Inorganic and Organic Liquids.tsv Table 2-312 Vapor Viscosity of Inorganic and Organic Substances.tsv VDI PPDS Dynamic viscosity of saturated liquids polynomials.tsv VDI PPDS Dynamic viscosity of gases polynomials.tsv Table 2-314 Vapor Thermal Conductivity of Inorganic and Organic Substances.tsv Table 2-315 Thermal Conductivity of Inorganic and Organic Liquids.tsv VDI PPDS Thermal conductivity of saturated liquids.tsv VDI PPDS Thermal conductivity of gases.tsv Staveley 1981.tsv COSTALD Parameters.tsv Mchaweh SN0 deltas.tsv Perry Parameters 105.tsv CRC Liquid Inorganic Constant Densities.tsv CRC Solid Inorganic Constant Densities.tsv VDI PPDS Density of Saturated Liquids.tsv CRC Inorganics densties of molten compounds and salts.tsv CRC Virial polynomials.tsv

Please feel free to merge whenever, Thanks!

yoelcortes commented 2 years ago

Forgot to mention, I think adding the sql database to the pip installation would be a good idea. I'm not sure how to add it to the conda installation either.

CalebBell commented 2 years ago

Hi Yoel!

We are very lucky - the conda package is built right from the package uploaded to PyPi. Because of that, the default.sqlite database needs to be generated on the machine which uploads the package to pypi. I added a script dev/prerelease.py which creates the database and will do other things related to the release as required. I am very pleased with this development; I didn't expect such a large win. The maintenance overhead will very hopefully be minimal.

The new release is version 1.1.1, as this feels like a major change. I have tested and confirmed both releases ship with the sqlite database.

I am not sure the idea of converting older .tsv files to use integer CAS numbers is worthwhile. In particular, I know of some people indexing into those files using CAS numbers as strings, so it would not be good to break that functionality them. I think pursuing expanding the database with temperature dependent property fits is a larger win, and hope to focus potential future memory reduction efforts into that.

Thank you for your excellent review!

Caleb