choderalab / modelforge

Infrastructure to implement and train NNPs
https://modelforge.readthedocs.io/en/latest/
MIT License
9 stars 4 forks source link

dataset curation update #11

Closed chrisiacovella closed 9 months ago

chrisiacovella commented 10 months ago

Description

Changes to hdf5 file formats.

For efficiency in reading/writing, conformers will remain grouped, e.g., geometry will be an m x n x 3 array, where m is number of conformations. Parsing will be done when we read the curated datafile. Data is tagged with an attribute denoting if it is series or not to easy in the parsing (allows routines to be more general for reading).

Curation also switched to have a base class, to standardized the input and save some effort on unit tests.

I still need to update base hdf5 reader in the dataset class to include checking for NAN. (need to merge code into module). The desired behavior will be to simply not add a conformer if any of the desired properties are NAN.

Todos

Notable points that this PR has either accomplished or will accomplish.

Status

codecov-commenter commented 10 months ago

Codecov Report

Merging #11 (3b5d03c) into main (ac94836) will decrease coverage by 5.94%. The diff coverage is 78.20%.

Additional details and impacted files
chrisiacovella commented 10 months ago

Weird Mamba issue on Mac OS causing failures. Not sure what is going on. CI successful on linux.

chrisiacovella commented 10 months ago

I think, once reviewed, we can merge this. the qcarchive spice dataset will be a separate PR. As I noted, this changes the structure of the hdf5 datasets a little bit (and the process for loading of them). So it would be good to get this part finalized, so we don't have to retrain a lot of models.