choderalab / modelforge

Infrastructure to implement and train NNPs
https://modelforge.readthedocs.io/en/latest/
MIT License
11 stars 4 forks source link

local caching file locking #174

Closed chrisiacovella closed 2 months ago

chrisiacovella commented 3 months ago

Description

As mentioned in issue #173, this PR focuses on implementing file locking for the cached files generated when loading up a dataset (i.e., the gzipped hdf5 file, the unzipped hdf5 file, the npz file). The file locking will prevent two processes from trying to, say, download the same .gz file. I'll note this also solves the problem where the h5py library will just cause the code to exit if the hdf5 file is in use (by using temporary lock files that will wait for the file to be unlocked before proceeding rather than quitting).

This uses the fcntl module in python which provides an interface to the ioctl and fcntl functions on Unix/Linux. This will not work on Windows (which we are not targeting anyway), but I've trying to make these rather modular (i.e., writing functions and a context manager that call this library, rather than using the library directly in the source code) so there is an option to create a drop-in replacement if required.

Additional features in this PR: A 4th .toml file for runtime specific parameters. This will be the main file someone would modify for a new experiment (assuming default parameters for dataset, potential, and training options). This would include things like, directory to save the output to, local caching directory, max number of epochs, cpu or gpu, etc. I think we also need to add in a few other options, such as regenerating the cache and forcing download of the data files. Additionally, we need to also be able to set the seed used in the splitting routine.

Other notes:

The plan is to merge this after #169 , as it will likely be more straight forward to resolve changes in this PR, than merging these changes into #169.

Edit July 5: working on merging with the changes made in #169. This will also look at some of the refactoring in #177

Todos

Notable points that this PR has either accomplished or will accomplish.

Status

chrisiacovella commented 3 months ago

MAC CI is failing due to a dependency. I'll work on revising those tests; several tests are also being skipped on linux with the reason "fails on Mac OS", so I think this should be looked into to make sure the environment skipping is correct.