IntelLabs / matsciml

Open MatSci ML Toolkit is a framework for prototyping and scaling out deep learning models for materials discovery supporting widely used materials science datasets, and built on top of PyTorch Lightning, the Deep Graph Library, and PyTorch Geometric.
MIT License
145 stars 21 forks source link

[Feature request]: Homogenization of data structures and physical representations #104

Open laserkelvin opened 9 months ago

laserkelvin commented 9 months ago

Feature/behavior summary

To ensure consistency in modeling, each dataset in Open MatSciML Toolkit should have uniform (or near uniform) kinds of data. For example, whether coordinates provided are fractional or Cartesian, ensuring every dataset has sufficient information to represent each data sample in a physically meaningful way, such as periodic boundary conditions (for use in e.g. shift vectors).

Request attributes

Related issues

No response

Solution description

A good place to start would be to make sure each devset, and subsequently any serialized datasets we have conform to the following:

  1. Check if the coordinates are fractional or not (if there are values outside of 0 and 1 then they're likely Cartesian)
  2. Check to make sure we have enough information to create a Lattice object, can be just a cell key, or have the lattice parameters like materials project
  3. Generally just print and list out the keys in the sample, construct a table of them, so that we can help contribute to #97

We should also check other projects, like Colabfit, to see what extent we can try and conform to community standards, too.

Additional notes

Can't assign Bin yet, but would be good for Bin to aggregate information, and between him and @melo-gonzo to help craft PRs to address things after the survey is done.

bmuaz commented 9 months ago

I had the same thoughts about the data structures and will be happy to work on it with Carmelo.