File hashes - Githubissues

chrisiacovella commented 3 months ago

When we got to train, we will load up a curated dataset, creating local files which we cache for future use. We need to store/compare the hash of these files to ensure that we are not working with an incorrect or partially generated file. This should be trivial for the gzipped hdf5 files, but the npz files we generate locally, we will need to probably generate some metadata file after it is created that stores the hash, since this hash is not known beforehand.

chrisiacovella commented 2 months ago

The general sequence of the data loader is

download the .hdf5.gz file
unzip the .hdf5.gz file
load the .hdf5 file
save the .hdf5 file as an .npz file
load the .hdf5 file as an .npz file

Ideally, we want to use the .npz file if it exists, and skip all the rest. If the .npz doesn't exist, we should check to see if the .hdf5 file exists. If not, we will check to see if the .hdf5.gz exists. If not, we will download. However, we can't just rely on seeing if a file exists, we need to make sure that it is the correct file.

A few changes I've been making in PR #91 .

Rather than using a fully generic filename for the various temporary files (i.e., not using "cached.npz" or something that doesn't tell us which dataset we have) each file has a unique name defined in the data loader (so, something like "qm9_dataset_processed.npz", "qm9_dataset.hdf5", "qm9_dataset.hdf5.gz").
While giving dataset specific names helps, we also need to validate the checksums of the files. The checksums are encoded in the data loader along with the filename. For example, if we call _download we will check to see if the .hdf5.gz exists, and if it does, we will then compare the checksum against the expected checksum. If the checksum doesn't match, we will know we need to download again. Note, we can also tell the code to force_download the file. Similarly, we logged the known checksum for the .hdf5 file.
An issue comes up when dealing with the npz files, as it appears the checksum may be different with different platforms. Furthermore, the datafile itself might change, if a different set of "properties of interest" are selected. To handle this, when writing the .npz file, we also write a .json file that contains some metadata, including the checksums of the .hdf5 and .hdf5.gz files used to generate the data, the data_keys used (i.e., properties of interest), as well as the date the .npz file was generated. We will check to ensure that the checksum of the .hdf5 file used to generated the .npz file matches the checksum in the data loader (to ensure that the code itself hasn't been updated). We also check to ensure that the properties of interest match those defined in the data loader. If all of these match, we can be reasonable safe in our loading. Note since the .json file is generated AFTER the npz file is written, we will only be able to read this in if the file is properly written.

wiederm commented 2 months ago

This has been addressed in the linked PRs. Closing for now.

choderalab / modelforge

File hashes #84