anthony-wang / CrabNet

Predict materials properties using only the composition information!
https://doi.org/10.1038/s41524-021-00545-1
MIT License
92 stars 28 forks source link

Passing DataFrames as args, no architecture format (i.e. non-platform specific), black code style, removal of large output data, backwards compatible #16

Closed sgbaird closed 2 years ago

sgbaird commented 3 years ago

Hey Anthony,

I'm opening this huge pull request because I've been sitting on a bunch of changes to CrabNet and figured it would be better to get some discussion started sooner than later. To give some background, the fork exists as a submodule for mat_discover. As the title describes, the core functionality no longer depends on relative or absolute file paths from the user's perspective (#11) and instead follows an automatminer-style format that I think is more portable (while still maintaining backwards compatibility). I replaced a number of things that were platform-specific (usually related to file paths), reformatted the repo in black code style, and removed a large portion of output data that pushed the combined, compressed file size of mat_discover, CrabNet, and ElM2D over the Test PyPI upload limit (100 MB). I refactored mat_discover to the point of being pip-installable and am almost done with the CI workflow, so CrabNet should also be ready to deploy standalone after some minimal changes related to import statements (i.e. replace mat_discover.CrabNet with CrabNet). I've been using flit to deal with PyPI and it's been fairly user-friendly.

As I mentioned, I made a large effort to keep it backwards compatible with the original CrabNet instructions. In other words, you should still be able to use a materials_data/data/<property> folder with train.csv, val.csv, and test.csv per the README instructions. Right now, the fit() and predict() style methods (#14) look like:

crabnet_model = get_model(mat_prop=mat_prop_name, train_df=train_df) #fit
val_true, val_pred, _, val_sigma = crabnet_model.predict(val_df) #predict

I think it would still be nice to get these into a more standard sklearn-esque format (e.g./i.e. wrap get_model with a mdl = Model.fit(train_df) method and val_pred = mdl.predict(val_df) with an optional kwarg for outputting the uncertainty).

Replacing the output csv files might bring it below the 100 MB compressed limit for Test PyPI, but the CrabNet repo is pretty large as is, so I think it might be better to put this output data on figshare and give instructions or a script for grabbing the data. I also implemented a function in mat_discover such that you can grab CrabNet data from wherever using open_text even after it's been installed via pip with some optional sugar for dividing into train/val or train/val/test splits.

Feel free to push back on any of these changes. I mainly made these changes for mat_discover and the things that come next. Interested to hear your thoughts and let me know via Slack if you want to meet over Zoom to discuss or see some of the functionality in action.

Sterling