CalebBell / chemicals

chemicals: Chemical database of Chemical Engineering Design Library (ChEDL)
MIT License
186 stars 36 forks source link

Add four more sources of constants data - Wikidata, NIST Webbook, Common Chemistry, and the Joback group contribution method #39

Closed CalebBell closed 2 years ago

CalebBell commented 2 years ago

Creating a PR to observe if tests pass on other platforms; not ready yet.

CalebBell commented 2 years ago

Hi Yoel, This PR adds the following data sources:

It makes the following technical changes:

Also included are the following scripts:

A script for NIST WebBook is not yet available, and I hope to add further data for non-constant properties from them as well.

Could you give this PR a quick look and let me know what you think? I am trying to expand the number of chemicals with available data.

Sincerely, Caleb

yoelcortes commented 2 years ago

Hi Caleb,

I see you have been busy! These updates all look very exciting. I'll have a detailed look within the next day.

Thanks!

yoelcortes commented 2 years ago

Hi Caleb! The updates look great overall!

I made some minor changes to prevent attempting CAS_to_int if the user passes an integer (which is an update hope to make in thermo during chemical creation). Let me know if there are any issues with my implementation.

I added a "low memory" run to the github workflow, which shows some issues with the test. A few of the issues are trivial (like dataframe shapes not matching), but others may need a little more time to solve. But all the tests are passing without "low memory".

I'll be flying back to the US and will be a little busy until January 7, so I won't get a chance to address the test issues until then.

Wish you happy holidays and a happy new year! Thanks,

CalebBell commented 2 years ago

Hello Yoel!

Thank you very much for the multitude of improvements. If I understand what you are after, you would like to be able to do "Tc(CASRN=64175)" and get the critical temperature? I think that is pretty doable with a few more changes if so. Updating the documentation and adding a few more tests is probably the hardest part to making that happen. I'm glad you noticed the CHEDL_LOW_MEMORY environmental variable I was playing around with. I was curious how much memory could be saved if more strings were removed, so I added a flag. It wasn't my intention to be providing this as a feature to users. The flag saves about 1/3 of the memory used, or 6 MB. Longer term, I hope to be able to reduce memory use for most users in the following two ways:

Your point about the tests not passing with that variable set was expected for me. The flag essentially removes some of the functionality of the library, so some of the tests should be expected to fail. I was able to make them pass through a few fixes and replacing the previously deleted column with an empty sparse column. I like the pytest option to run the tests with a changed environmental variable; nice! I definitely could have communicated better what I was trying to do with the flag though. With my hopefully-better explanation, do you think it is still necessary to add the CI for the low memory mode? I wouldn't want to update documentation and point users towards this flag at this point.

I wish you a happy holidays and new year as well! Caleb

yoelcortes commented 2 years ago

Hi Caleb,

OK! I like the flag and may use it myself in the near future. The points you mentioned on reducing memory use would be wonderful too.

It's great that you were able to get the tests passing with low memory! I agree with not updating documentation and pointing users towards the new flag yet. If we intend to keep the CHEDL_LOW_MEMORY variable, I think keeping the flag in the CI may be better in the long run. But if there is a good chance we remove it in favor of something else, we can remove it from the CI (your call).

Thinking for BioSTEAM, I might include a function to delete all the dataframes in chemicals after loading the thermodynamic property packages. The function would also set "_loaded" attributes to false. This would save almost 100% of the memory. But using the flag may be useful for more dynamic efforts to keep the dataframes.

Thanks!