Open MatSci ML Toolkit is a framework for prototyping and scaling out deep learning models for materials discovery supporting widely used materials science datasets, and built on top of PyTorch Lightning, the Deep Graph Library, and PyTorch Geometric.
MIT License
145
stars
21
forks
source link
[Feature request]: Homogenization of data structures and physical representations #104
To ensure consistency in modeling, each dataset in Open MatSciML Toolkit should have uniform (or near uniform) kinds of data. For example, whether coordinates provided are fractional or Cartesian, ensuring every dataset has sufficient information to represent each data sample in a physically meaningful way, such as periodic boundary conditions (for use in e.g. shift vectors).
Request attributes
[X] Would this be a refactor of existing code?
[ ] Does this proposal require new package dependencies?
[ ] Would this change break backwards compatibility?
[ ] Does this proposal include a new model?
[ ] Does this proposal include a new dataset?
[ ] Does this proposal include a new task/workflow?
Related issues
No response
Solution description
A good place to start would be to make sure each devset, and subsequently any serialized datasets we have conform to the following:
Check if the coordinates are fractional or not (if there are values outside of 0 and 1 then they're likely Cartesian)
Check to make sure we have enough information to create a Lattice object, can be just a cell key, or have the lattice parameters like materials project
Generally just print and list out the keys in the sample, construct a table of them, so that we can help contribute to #97
We should also check other projects, like Colabfit, to see what extent we can try and conform to community standards, too.
Additional notes
Can't assign Bin yet, but would be good for Bin to aggregate information, and between him and @melo-gonzo to help craft PRs to address things after the survey is done.
Feature/behavior summary
To ensure consistency in modeling, each dataset in Open MatSciML Toolkit should have uniform (or near uniform) kinds of data. For example, whether coordinates provided are fractional or Cartesian, ensuring every dataset has sufficient information to represent each data sample in a physically meaningful way, such as periodic boundary conditions (for use in e.g. shift vectors).
Request attributes
Related issues
No response
Solution description
A good place to start would be to make sure each
devset
, and subsequently any serialized datasets we have conform to the following:We should also check other projects, like Colabfit, to see what extent we can try and conform to community standards, too.
Additional notes
Can't assign Bin yet, but would be good for Bin to aggregate information, and between him and @melo-gonzo to help craft PRs to address things after the survey is done.