choderalab / modelforge

Infrastructure to implement and train NNPs
https://modelforge.readthedocs.io/en/latest/
MIT License
9 stars 4 forks source link

File hashes #84

Closed chrisiacovella closed 2 months ago

chrisiacovella commented 3 months ago

When we got to train, we will load up a curated dataset, creating local files which we cache for future use. We need to store/compare the hash of these files to ensure that we are not working with an incorrect or partially generated file. This should be trivial for the gzipped hdf5 files, but the npz files we generate locally, we will need to probably generate some metadata file after it is created that stores the hash, since this hash is not known beforehand.

chrisiacovella commented 2 months ago

The general sequence of the data loader is

Ideally, we want to use the .npz file if it exists, and skip all the rest. If the .npz doesn't exist, we should check to see if the .hdf5 file exists. If not, we will check to see if the .hdf5.gz exists. If not, we will download. However, we can't just rely on seeing if a file exists, we need to make sure that it is the correct file.

A few changes I've been making in PR #91 .

wiederm commented 2 months ago

This has been addressed in the linked PRs. Closing for now.