choderalab / modelforge

Infrastructure to implement and train NNPs
https://modelforge.readthedocs.io/en/latest/
MIT License
9 stars 4 forks source link

Initial structure for datasets #4

Closed wiederm closed 11 months ago

wiederm commented 11 months ago

Description

Provides a dataset implementation that can be used for most of the QM Datasets provided generated from the QCArchive and provided as HDF5file.

For a specific dataset (e.g. QM9) generating a dataset is as easy as:

factory = DatasetFactory()
prop = QM9Dataset()
dataset = factory.create_dataset(prop)

and the dataset provides Pytorch Dataloader for a default data split (random split, 80:10:10). The split can be stored and reproduced using dataset.store_split(filename) and dataset.restore_split(filename.

The DatasetFactory() will first check if there the dataset is already cached on disk, if not it will download the dataset specified in prop and generate a local cache. The local cache is a npy file. Padding is performed by the PadTensors class.

Currently only the RandomSplittingStrategy is implemented, and will be used by default in the factory.create_dataset call, but other splitting strategies (e.g. TimeSplit) can be implemented.

Todos

Status

chrisiacovella commented 11 months ago

I think for the HDF5dataset we need to have an abstract class for download, to ensure any derived classes defines where to retrieve the data. I know this is in the downloader class, but I'm not quite sure it make sense to have it there, or to have a separate download class in general. It seems like the code would be simpler if we just had some general functions defined in utils, and how these specifically get implemented shows up in the child class of HDF5dataset. Basically, I'm not convinced there needs to be a separate downloader class.

I think in a future PR we need to workout how exactly we want to control the ability of a user to define the source of the dataset (e.g., from zenodo, google drive, local source).

I don't think that is necessary to resolve all of this here in this moment, given the volume of other changes in the PR.

chrisiacovella commented 11 months ago

The good news is test are passing, but they seem to be taking way too long, considering we aren't doing too much.

I timed (locally) the tests and these are the top 8 (I'm only reporting things that took more than a minute):

test_different_scenarios_of_file_availability -- 159.45s test_file_cache_methods -- 159.23s test_numpy_dataset_assignment -- 83.81s test_dataset_dataloaders -- 82.80s test_data_item_format -- 82.56s test_dataset_splitting -- 82.52s test_dataset_generation -- 81.30s test_file_existence_after_initialization — 80.22s

Not surprisingly, these all include dealing with the dataset in some fashion.

wiederm commented 11 months ago

Let's merge this PR and address all open suggestions in separate PRs

chrisiacovella commented 11 months ago

Did you forget to add in transform.py?

ModuleNotFoundError: No module named 'modelforge.dataset.transformation' 31