IntelLabs / matsciml

Open MatSci ML Toolkit is a framework for prototyping and scaling out deep learning models for materials discovery supporting widely used materials science datasets, and built on top of PyTorch Lightning, the Deep Graph Library, and PyTorch Geometric.
MIT License
144 stars 20 forks source link

[Feature request]: Refactor `MaterialsProjectDataset` to not serialize `pymatgen` `Structures` in LMDB #267

Open laserkelvin opened 2 months ago

laserkelvin commented 2 months ago

Feature/behavior summary

Currently, the workflow implemented for MaterialsProjectDataset will save and reload a pymatgen.Structure object. The issue with this is that it is very intimately tied to the version of pymatgen, where small API changes can make it difficult to reload the dataset in later versions.

Request attributes

Related issues

No response

Solution description

If we can refactor it so that Structures are created at load time - in line with other dataset implementations - it would make it break this dependency...breaking.

We would have to re-process the existing LMDBs being distributed, and make sure that the data is stored as just plain coordinates, atoms, and lattice parameters.

Additional notes

No response