Closed chrisiacovella closed 9 months ago
Weird Mamba issue on Mac OS causing failures. Not sure what is going on. CI successful on linux.
I think, once reviewed, we can merge this. the qcarchive spice dataset will be a separate PR. As I noted, this changes the structure of the hdf5 datasets a little bit (and the process for loading of them). So it would be good to get this part finalized, so we don't have to retrain a lot of models.
Description
Changes to hdf5 file formats.
For efficiency in reading/writing, conformers will remain grouped, e.g., geometry will be an m x n x 3 array, where m is number of conformations. Parsing will be done when we read the curated datafile. Data is tagged with an attribute denoting if it is series or not to easy in the parsing (allows routines to be more general for reading).
Curation also switched to have a base class, to standardized the input and save some effort on unit tests.
I still need to update base hdf5 reader in the dataset class to include checking for NAN. (need to merge code into module). The desired behavior will be to simply not add a conformer if any of the desired properties are NAN.
Todos
Notable points that this PR has either accomplished or will accomplish.
Status