choderalab / modelforge

Infrastructure to implement and train NNPs
https://modelforge.readthedocs.io/en/latest/
MIT License
9 stars 4 forks source link

Curation #6

Closed chrisiacovella closed 10 months ago

chrisiacovella commented 11 months ago

Description

This software will relying upon datasets that we have curated, stored as hdf5 files. To generate these files, we must parse the original sources. This PR provides a first pass, creating a class for the curation of qm9.

Given that different datasets will be stored in different formats, different variable names, different measures, etc. it will be challenging to define a single reusable class. Since we do not expect a user to ever need to call these functions (they are included for transparency and capturing provenance), we need not worry too much about the generality of this part of the code.

The hdf5 writing class should be sufficiently general and work for all datasets.

A few notes:

All parameters get tagged with units (via openff-units). This will allow us to easily write to the hdf5 file with a desired set of units. Note, hdf5 files get the units added as attributes (stored as strings).

The hdf5 reader in dataset.py has also be revised for improved efficiency. Given the size of the files, gzipping the files provides a substantial savings (important for speeding up downloads), however, reading in the gzipped files was proving to be a substantial bottleneck. Uncompressed files could be loaded in a few minutes, whereas the gzipped files took an hour.

The initial approached was as follows:

with gzip.open(self.raw_data_file, "rb") as gz_file, h5py.File( gz_file, "r") as hf:

This has been replaced with the following (leading to an uncompressed version of the file along side the .gz file, so similar to running gzip -d file.gz from the commandline/system call):

with gzip.open(self.raw_data_file, "rb") as gz_file:
    with open(self.raw_data_file.replace(".gz", ""), "wb") as out_file:
         shutil.copyfileobj(gz_file, out_file)
         with h5py.File(self.raw_data_file.replace(".gz", ""), "r") as hf:

This provides nearly identical performance to using the uncompressed file.

Todos

Note: do ANI curation in separate PR. After merged, do a new PR fixing dataset to handle slight changes to QM9 file structure.

Status

chrisiacovella commented 10 months ago

Unit conversion handled on a per quantity basis; output units are hardcoded into the constructor to output into our desired units for hdf5 files. The unit conversion step can be skipped via a simple flag in the constructor (i.e., keeping parameters in the original units as that may be potentially useful for validation).