Closed chrisiacovella closed 1 month ago
The curation scripts have been updated for the various spice datasets to allow us to filter by element, to create datasets that we will be able to train with ani.
With the tests refactored in a recently merged PR, I will update this PR get rid of the "for_unit_testing" and instead have this be a string that will allow us to toggle different datasets ("full", "test"...and for spice datasets "full_7element", "test_7element" or some names of that nature).
I'm going to merge this now. I resolved a qcarchive issue so I was able to fold in the PhAlkEthOH dataset as well.
As the title suggests, I modified the datasets such that, instead of hardcoding in the url, checksum, filename, etc. these are now stored in yaml files.
The basic structure of the yaml file (e.g., ani1x):
This will be useful as it will make it easier to define a set of different datasets (e.g., limiting other datasets to only the elements in ani2x).
This is basically the same info I had in the datasets before, but presumably we can define any number of files, rather than just the full dataset or the unit testing dataset.
For now I've not changed the constructors (e.g., they accept "for_unit_testing"). I was going to hold off on changing this (I think wait to the other PRs are merged as those are changing other aspects of the datasets and tests). I think this can be changed to a variable "dataset_type" that takes a string. For example, this could be "full_dataset" (default), "unit_testing" and like "elements_HCNO" (i.e., a restricted dataset.
update: I've decided to extend the same approach to the dataset curation. These yaml files will define the download link and checksum. These files could/should be stored along with the resulting curated files to better define the source (especially useful in cases where the original dataset source is updated).
Status