Closed stephengreen closed 3 years ago
I will have to think a bit more about all the points we mention, but a priori I do not understand why certain methods need to be stripped out of the class. As I see it, a waveform dataset class takes care of operations one wants to do one such a data set, and natural operations are to create it, to save it or to load it, and to provide access to its elements.
The particular inputs I gets are needed for certain tasks. If you prefer those inputs could be moved from init to the particular method that needs it and to only pass shared state variables to the constructor.
It's just a general preference for smaller classes, each of which is simpler. In the case of WaveformDataset
, the class attributes and methods fall into two distinct categories, depending on the use case of the class:
_prior
_waveform_generator
domain
(also contained in waveform_generator
)sample_prior()
(wrapper method)_generate_polarizations_task_fun()
(wrapper method)generate_dataset()
_write_dataframe_to_hdf5()
dataframe_to_structured_array()
get_polarizations()
get_compressed_polarizations()
save()
transform
__getitem__()
read_parameter_samples()
load()
_parameter_samples
_waveform_polarizations
There are no common methods among the two use cases. Among attributes, _prior
and _waveform_generator
won't be loaded at train time, and transform
won't be loaded at construction time. This indicates to me these attributes (and associated methods) really belong in separate classes: one for constructing, one for training.
Regarding dataset construction, I think that this is better suited to the procedural programming style that's in the tutorial (and is also used to construct transforms) rather using a class. This is because at this stage the dataset is not being treated so much as an object that we want to repeatedly access and manipulate (whereas this is the case for the prior and waveform generator, and the dataset when used in training). Since the construction methods are basically wrapped by functions in the tutorial (e.g., generate_waveforms()
in generate_waveforms.py
) and only used in these places, a refactoring would be to simply move the methods into these functions. The tutorial also saves directly a number of intermediate data files, as well as the final HDF5 file, essentially duplicating the save()
method of WaveformDataset
.
With these changes the tutorial code would work directly with the waveform generator and the prior instances.
Ok, I'll take a look.
This is addressed in https://github.com/dingo-gw/dingo-devel/pull/17.
WaveformDataset
I am still not entirely happy with the way the
WaveformDataset
works, basically because it is used for several unrelated tasks depending on context. This is reflected in the fact that the__init__
function has several optional arguments (a dataset file, a prior, a waveform generator, and a transform). So sometimes one uses a waveform generator to build a dataset, and other times loads a waveform dataset from file and applies transforms to the waveforms. The class methods also seem to fall into several groups depending on the use case of the class, and they do not really interact with each other across use case. This suggests thatWaveformDataset
should either (1) be split into two classes (one for generating a dataset, one for loading), or (2) slimmed down into one smaller class (loading) and a set of functions for generating the dataset.Based on how the "tutorial" for generating waveforms is progressing, option (2) seems most natural. Many of the tutorial functions merely wrap
WaveformDataset
methods, which in turn are only called once by these functions. So it would make sense to fully move this method functionality into the tutorial functions. One would then be left with a much smallerWaveformDataset
class, which would only be used for loading data and feeding it to the dataloader. I think this would really streamline these aspects of the code.Example
The method
generate_dataset
is called twice in the code: intutorials/02_gwpe/generate_waveforms.py: generate_waveforms
and ingw.transforms.get_parameter_list
. The latter is only used to obtain a list of parameters, not to actually produce a dataset of waveforms. So this functionality could easily be moved directly intogenerate_waveforms
in the tutorial (which already works directly with the waveform generator and prior).Minor comments: