Initial structure for datasets

wiederm commented 11 months ago

Description

Provides a dataset implementation that can be used for most of the QM Datasets provided generated from the QCArchive and provided as HDF5file.

For a specific dataset (e.g. QM9) generating a dataset is as easy as:

factory = DatasetFactory()
prop = QM9Dataset()
dataset = factory.create_dataset(prop)

and the dataset provides Pytorch Dataloader for a default data split (random split, 80:10:10). The split can be stored and reproduced using dataset.store_split(filename) and dataset.restore_split(filename.

The DatasetFactory() will first check if there the dataset is already cached on disk, if not it will download the dataset specified in prop and generate a local cache. The local cache is a npy file. Padding is performed by the PadTensors class.

Currently only the RandomSplittingStrategy is implemented, and will be used by default in the factory.create_dataset call, but other splitting strategies (e.g. TimeSplit) can be implemented.

Todos

[x] implement DatasetFactory that generates a dataset with a specified splitting and dataloader that are ready for use
[x] implement Caching class
[x] implement Download class
[x] implement utility functions such as : PadTensors, SplittingStrategy, decompress_gziped_file
[x] implement QM9Dataset
[x] all tests are passing
[x] pass custom list of properties that are retrieved from hdf5 file

Status

[ ] Ready to go

chrisiacovella commented 11 months ago

I think for the HDF5dataset we need to have an abstract class for download, to ensure any derived classes defines where to retrieve the data. I know this is in the downloader class, but I'm not quite sure it make sense to have it there, or to have a separate download class in general. It seems like the code would be simpler if we just had some general functions defined in utils, and how these specifically get implemented shows up in the child class of HDF5dataset. Basically, I'm not convinced there needs to be a separate downloader class.

I think in a future PR we need to workout how exactly we want to control the ability of a user to define the source of the dataset (e.g., from zenodo, google drive, local source).

I don't think that is necessary to resolve all of this here in this moment, given the volume of other changes in the PR.

chrisiacovella commented 11 months ago

The good news is test are passing, but they seem to be taking way too long, considering we aren't doing too much.

I timed (locally) the tests and these are the top 8 (I'm only reporting things that took more than a minute):

test_different_scenarios_of_file_availability -- 159.45s test_file_cache_methods -- 159.23s test_numpy_dataset_assignment -- 83.81s test_dataset_dataloaders -- 82.80s test_data_item_format -- 82.56s test_dataset_splitting -- 82.52s test_dataset_generation -- 81.30s test_file_existence_after_initialization — 80.22s

Not surprisingly, these all include dealing with the dataset in some fashion.

A quick test did shave some time off the test by using a local copy of the datafile. Obviously for functions where we want to test downloading records this makes sense to use the remote copies, but I feel as though we should include one of the small dataset files in a data folder under tests to remove this overhead (but I think we should hold off until I upload new datafiles with the different unit system we decided on).
The most significant savings was by using a smaller dataset file. Switching to the 100 records file instead of 1000 cut the test_numpy_dataset_assignment down to a few seconds (more than a factor of 20 reduction in time). I uploaded this dataset to the google drive and there was no difference between local and remote copy. I couldn't figure out how to profile download speed easily in pytest, but I wouldn't be surprised if download speed is higher for smaller files than larger files on the CI nodes. The n100 dataset for testing is below. Obviously subbing this in would require changing the test/train/val split numbers by a factor of 10 (as that was based on 1000).
```
self.test_id = "17oZ07UOxv2fkEmu-d5mLk6aGIuhV0mJ7"
```

wiederm commented 11 months ago

Let's merge this PR and address all open suggestions in separate PRs

chrisiacovella commented 11 months ago

Did you forget to add in transform.py?

ModuleNotFoundError: No module named 'modelforge.dataset.transformation' 31

choderalab / modelforge