Closed wiederm closed 11 months ago
I think for the HDF5dataset we need to have an abstract class for download, to ensure any derived classes defines where to retrieve the data. I know this is in the downloader class, but I'm not quite sure it make sense to have it there, or to have a separate download class in general. It seems like the code would be simpler if we just had some general functions defined in utils, and how these specifically get implemented shows up in the child class of HDF5dataset. Basically, I'm not convinced there needs to be a separate downloader class.
I think in a future PR we need to workout how exactly we want to control the ability of a user to define the source of the dataset (e.g., from zenodo, google drive, local source).
I don't think that is necessary to resolve all of this here in this moment, given the volume of other changes in the PR.
The good news is test are passing, but they seem to be taking way too long, considering we aren't doing too much.
I timed (locally) the tests and these are the top 8 (I'm only reporting things that took more than a minute):
test_different_scenarios_of_file_availability -- 159.45s test_file_cache_methods -- 159.23s test_numpy_dataset_assignment -- 83.81s test_dataset_dataloaders -- 82.80s test_data_item_format -- 82.56s test_dataset_splitting -- 82.52s test_dataset_generation -- 81.30s test_file_existence_after_initialization — 80.22s
Not surprisingly, these all include dealing with the dataset in some fashion.
The most significant savings was by using a smaller dataset file. Switching to the 100 records file instead of 1000 cut the test_numpy_dataset_assignment down to a few seconds (more than a factor of 20 reduction in time). I uploaded this dataset to the google drive and there was no difference between local and remote copy. I couldn't figure out how to profile download speed easily in pytest, but I wouldn't be surprised if download speed is higher for smaller files than larger files on the CI nodes. The n100 dataset for testing is below. Obviously subbing this in would require changing the test/train/val split numbers by a factor of 10 (as that was based on 1000).
self.test_id = "17oZ07UOxv2fkEmu-d5mLk6aGIuhV0mJ7"
Let's merge this PR and address all open suggestions in separate PRs
Did you forget to add in transform.py?
ModuleNotFoundError: No module named 'modelforge.dataset.transformation' 31
Description
Provides a dataset implementation that can be used for most of the QM Datasets provided generated from the QCArchive and provided as HDF5file.
For a specific dataset (e.g. QM9) generating a dataset is as easy as:
and the dataset provides Pytorch
Dataloader
for a default data split (random split, 80:10:10). The split can be stored and reproduced usingdataset.store_split(filename)
anddataset.restore_split(filename
.The
DatasetFactory()
will first check if there the dataset is already cached on disk, if not it will download the dataset specified inprop
and generate a local cache. The local cache is a npy file. Padding is performed by thePadTensors
class.Currently only the
RandomSplittingStrategy
is implemented, and will be used by default in thefactory.create_dataset
call, but other splitting strategies (e.g.TimeSplit
) can be implemented.Todos
DatasetFactory
that generates a dataset with a specified splitting and dataloader that are ready for useCaching
classDownload
classPadTensors
,SplittingStrategy
,decompress_gziped_file
Status