Streamlining WaveformDataset

stephengreen commented 3 years ago

`WaveformDataset`

I am still not entirely happy with the way the WaveformDataset works, basically because it is used for several unrelated tasks depending on context. This is reflected in the fact that the __init__ function has several optional arguments (a dataset file, a prior, a waveform generator, and a transform). So sometimes one uses a waveform generator to build a dataset, and other times loads a waveform dataset from file and applies transforms to the waveforms. The class methods also seem to fall into several groups depending on the use case of the class, and they do not really interact with each other across use case. This suggests that WaveformDataset should either (1) be split into two classes (one for generating a dataset, one for loading), or (2) slimmed down into one smaller class (loading) and a set of functions for generating the dataset.

Based on how the "tutorial" for generating waveforms is progressing, option (2) seems most natural. Many of the tutorial functions merely wrap WaveformDataset methods, which in turn are only called once by these functions. So it would make sense to fully move this method functionality into the tutorial functions. One would then be left with a much smaller WaveformDataset class, which would only be used for loading data and feeding it to the dataloader. I think this would really streamline these aspects of the code.

Example

The method generate_dataset is called twice in the code: in tutorials/02_gwpe/generate_waveforms.py: generate_waveforms and in gw.transforms.get_parameter_list. The latter is only used to obtain a list of parameters, not to actually produce a dataset of waveforms. So this functionality could easily be moved directly into generate_waveforms in the tutorial (which already works directly with the waveform generator and prior).

Minor comments:

We might want to provide a way for a user to interactively generate a waveform dataset, i.e., consolidate all of the various scripts in the tutorial via a single function to be run on one machine. (This would be a third way of running it.)

mpuerrer commented 3 years ago

I will have to think a bit more about all the points we mention, but a priori I do not understand why certain methods need to be stripped out of the class. As I see it, a waveform dataset class takes care of operations one wants to do one such a data set, and natural operations are to create it, to save it or to load it, and to provide access to its elements.

The particular inputs I gets are needed for certain tasks. If you prefer those inputs could be moved from init to the particular method that needs it and to only pass shared state variables to the constructor.

stephengreen commented 3 years ago

It's just a general preference for smaller classes, each of which is simpler. In the case of WaveformDataset, the class attributes and methods fall into two distinct categories, depending on the use case of the class:

Dataset construction:

_prior
_waveform_generator
domain (also contained in waveform_generator)
sample_prior() (wrapper method)
_generate_polarizations_task_fun() (wrapper method)
generate_dataset()
_write_dataframe_to_hdf5()
dataframe_to_structured_array()
get_polarizations()
get_compressed_polarizations()
save()

Training:

transform
__getitem__()
read_parameter_samples()
load()

Common:

_parameter_samples
_waveform_polarizations

There are no common methods among the two use cases. Among attributes, _prior and _waveform_generator won't be loaded at train time, and transform won't be loaded at construction time. This indicates to me these attributes (and associated methods) really belong in separate classes: one for constructing, one for training.

Regarding dataset construction, I think that this is better suited to the procedural programming style that's in the tutorial (and is also used to construct transforms) rather using a class. This is because at this stage the dataset is not being treated so much as an object that we want to repeatedly access and manipulate (whereas this is the case for the prior and waveform generator, and the dataset when used in training). Since the construction methods are basically wrapped by functions in the tutorial (e.g., generate_waveforms() in generate_waveforms.py) and only used in these places, a refactoring would be to simply move the methods into these functions. The tutorial also saves directly a number of intermediate data files, as well as the final HDF5 file, essentially duplicating the save() method of WaveformDataset.

With these changes the tutorial code would work directly with the waveform generator and the prior instances.

mpuerrer commented 3 years ago

Ok, I'll take a look.

mpuerrer commented 3 years ago

This is addressed in https://github.com/dingo-gw/dingo-devel/pull/17.

dingo-gw / dingo