Reintroduce dataset collections

CloudyOverhead commented 3 years ago

Dataset collections can be used to mix datasets of differing complexity. I see potential in having a more general class for overseeing dataset management, not just combining datasets, but also handling variations over a dataset. For instance, the --noise argument reintroduced in PR #56 is not ideal in my opinion: I have the constant fear of forgetting to add --noise when launching some training. We could have this be a dataset variant (and always have the dataset train with --noise on), but we would need to turn it off for testing (it sort of defeats the purpose of having the same dataset handle training, validation and testing data).

I'm unsure how this could be implemented, but we could use the fact that training, validation and testing datasets are basically collections of variations over a dataset as food for thought. I think we could have a DatasetCatalog class examine Dataset classes and their children and have children classes carry tags that tells them apart. DatasetCatalog's __getitem__ method could take tuples as inputs and return collections.

I would appreciate any opinion on the matter, @gfabieno, @jadesc and @jefbutar! Thank you in advance! 🚀

CloudyOverhead commented 3 years ago

Having a dataset catalog would allow generalizing parse_args (introduced in #56, 199287d) to facilitate usage in private projects. DatasetCatalog could have a register method that registers datasets, parents and tags. I'm taking my inspiration mostly from Detectron2.

CloudyOverhead commented 3 years ago

Catalogs could also be implemented for architectures and hyperparameters. Both catalogs could share a common class for registration and tagging, although I see no point in having collections of networks (except for ensemble learning, but making the parallel with dataset collections would be a stretch, I think). One of the rationales behind having NN catalogs is that fetching a specific network in a private project requires to have the same repository structure as GeoFlow (as is the case for datasets). Also, this could facilitate managing multiple variations on pairs of NNs and hyperparameters: I have 2 differing networks and 3 sets of hyperparameters for my own project. This is also food for thought, as I think this is less necessary and less straightforward to implement than dataset catalogs. Model zoos (TensorFlow Garden, Detectron2 Model Zoo) do not really rely on catalogs and use hyperparameter classes by themselves.

gfabieno / GeoFlow

Reintroduce dataset collections #57