Curation - Githubissues

Description

This software will relying upon datasets that we have curated, stored as hdf5 files. To generate these files, we must parse the original sources. This PR provides a first pass, creating a class for the curation of qm9.

Given that different datasets will be stored in different formats, different variable names, different measures, etc. it will be challenging to define a single reusable class. Since we do not expect a user to ever need to call these functions (they are included for transparency and capturing provenance), we need not worry too much about the generality of this part of the code.

The hdf5 writing class should be sufficiently general and work for all datasets.

A few notes:

All parameters get tagged with units (via openff-units). This will allow us to easily write to the hdf5 file with a desired set of units. Note, hdf5 files get the units added as attributes (stored as strings).

The hdf5 reader in dataset.py has also be revised for improved efficiency. Given the size of the files, gzipping the files provides a substantial savings (important for speeding up downloads), however, reading in the gzipped files was proving to be a substantial bottleneck. Uncompressed files could be loaded in a few minutes, whereas the gzipped files took an hour.

The initial approached was as follows:

with gzip.open(self.raw_data_file, "rb") as gz_file, h5py.File( gz_file, "r") as hf:

This has been replaced with the following (leading to an uncompressed version of the file along side the .gz file, so similar to running gzip -d file.gz from the commandline/system call):

with gzip.open(self.raw_data_file, "rb") as gz_file:
    with open(self.raw_data_file.replace(".gz", ""), "wb") as out_file:
         shutil.copyfileobj(gz_file, out_file)
         with h5py.File(self.raw_data_file.replace(".gz", ""), "r") as hf:

This provides nearly identical performance to using the uncompressed file.

Todos

[x] Add Unit tests.
[x] Add the ability to output the hdf5 file in a specific unit base. Decided
- on a per quantity basis (i.e., adding "output units" to the specification of the property and input units already in the code)
[x] add datafiles to g_drive for testing

Note: do ANI curation in separate PR. After merged, do a new PR fixing dataset to handle slight changes to QM9 file structure.

Status

[x] Ready to go

choderalab / modelforge

Curation #6

Description

Todos

Status