yaml files to define dataset download links/checksums.

chrisiacovella commented 1 month ago

As the title suggests, I modified the datasets such that, instead of hardcoding in the url, checksum, filename, etc. these are now stored in yaml files.

The basic structure of the yaml file (e.g., ani1x):

dataset: ani1x
full_dataset:
  gz_data_file:
    length: 4510287721
    md5: 408cdcf9768ac96a8ae8ade9f078c51b
    name: ani1x_dataset.hdf5.gz
  hdf5_data_file:
    md5: 361b7c4b9a4dfeece70f0fe6a893e76a
    name: ani1x_dataset.hdf5
  processed_data_file:
    md5: null
    name: ani1x_dataset_processed.npz
  url: https://www.dropbox.com/scl/fi/d98h9kt4pl40qeapqzu00/ani1x_dataset.hdf5.gz?rlkey=7q1o8hh9qzbxehsobjurcksit&dl=1
unit_testing_nc_1000:
  gz_data_file:
    length: 1761417
    md5: f47a92bf4791607d9fc92a4cf16cd096
    name: ani1x_dataset_nc_1000.hdf5.gz
  hdf5_data_file:
    md5: 776d38c18f3aa37b00360556cf8d78cc
    name: ani1x_dataset_nc_1000.hdf5
  processed_data_file:
    md5: null
    name: ani1x_dataset_nc_1000_processed.npz
  url: https://www.dropbox.com/scl/fi/26expl20116cqacdk9l1t/ani1x_dataset_ntc_1000.hdf5.gz?rlkey=swciz9dfr7suia6nrsznwbk6i&st=ryqysch3&dl=1

This will be useful as it will make it easier to define a set of different datasets (e.g., limiting other datasets to only the elements in ani2x).

This is basically the same info I had in the datasets before, but presumably we can define any number of files, rather than just the full dataset or the unit testing dataset.

For now I've not changed the constructors (e.g., they accept "for_unit_testing"). I was going to hold off on changing this (I think wait to the other PRs are merged as those are changing other aspects of the datasets and tests). I think this can be changed to a variable "dataset_type" that takes a string. For example, this could be "full_dataset" (default), "unit_testing" and like "elements_HCNO" (i.e., a restricted dataset.

update: I've decided to extend the same approach to the dataset curation. These yaml files will define the download link and checksum. These files could/should be stored along with the resulting curated files to better define the source (especially useful in cases where the original dataset source is updated).

Status

[x] Ready to go

chrisiacovella commented 1 month ago

The curation scripts have been updated for the various spice datasets to allow us to filter by element, to create datasets that we will be able to train with ani.

With the tests refactored in a recently merged PR, I will update this PR get rid of the "for_unit_testing" and instead have this be a string that will allow us to toggle different datasets ("full", "test"...and for spice datasets "full_7element", "test_7element" or some names of that nature).

chrisiacovella commented 1 month ago

I'm going to merge this now. I resolved a qcarchive issue so I was able to fold in the PhAlkEthOH dataset as well.

choderalab / modelforge

yaml files to define dataset download links/checksums. #117

Status