Closed stephengreen closed 2 years ago
I Implemented most requirements for the UniformFrequencyDomain
.
time_translate_data
method. It currently only supports complex numpy arrays with a single time shift across all channels, since this is what we need for detector projections. We will later also need to implement an efficient method time_translate_batch
, that applies individual time shifts to strains in different detectors, across an entire badge, on data where real and imaginary part are split in different channels. This needs to be very efficiently implemented, since we call it many times for the GNPE iterations. I know how to do it, but I leave it for later since it involves a few subtleties.time_translate_data
, it will automatically figure out whether translation is to be performed on truncated or original data.The other issues are mostly style related. I will have a look at them later, but for now I focus on what's required to get the prototype running.
I also made a few changes such as moving build_domains
to domains.py
, and adding the domain_dict
which allows for recovery of the domain via the build_domain
function.
For the data truncation, the only use case I can think of is when we have to generate EOB waveforms starting from much lower frequency than is ultimately desired. Is there another use case you have in mind?
We use it all the time. For IMRPhenom, we save the waveforms with frequencies in the range [0, 1024], but only use [20, 1024] for training. The truncation method takes care of that. In fact, we can even apply it to the compression Vh matrix, such that the decompressed output is already truncated.
Generally we want to be able to change the frequency range in train_config.yaml
without having to regenerate an expensive dataset.
Looking at the code, I believe that the domain truncation could be simplified if the frequencies between 0 and f_min
are chopped off at the very end of the transform sequence, rather than the beginning. This is how it was done in the old code:
The current approach requires significant duplication:
UniformFrequencyDomain
has len
/len_truncated
, sample_frequencies
/sample_frequencies_truncated
, as well as two cases in time_translate_data
.ASDDataset
and the WaveformDataset
. This typically takes the form of a truncate_dataset_domain
method.time_translate_data
.Moving truncation to the final preprocessing step will add about 2% to preprocessing costs (since we have to carry around frequencies below f_min
), but it would mean the code is simpler and more maintainable (since we won't have to think about whether we are working with truncated or non-truncated data), and also it is consistent with standard assumptions in LIGO software (e.g., waveform generation routines).
The above discussion only applies to frequencies between 0 and f_min
. We can still maintain the capability to reduce the frequency range from the dataset frequency range by specifying new f_min
and/or f_max
. This would be implemented via an argument to WaveformDataset.__init__
that modifies the dataset while loading it:
f_min
, the dataset (or svd matrix) is zeroed below the new f_min
.f_max
, the dataset (or svd matrix) is chopped off above the new f_max
.This way, though, we can always assume frequencies start at 0.
This sounds like a good solution to me. It's important that the code is easy to understand and it seems that the current code is perhaps more complicated than it needs to be and I'd choose a 2% increase in computational cost for significantly easier code in a heartbeat. That's just my opinion and Max should comment as well.
Just a few suggestions for how we implement domains, the first one being the most significant:
time_translate
method within each domain. That way the calling function does not have to know how to implement this for each domain. (It could get complicated forTimeDomain
, for instance.) This method should take an array of strain data and an amount to time translate by. Time shifting gets called at several points throughout the code (waveform generation, detector projection, GNPE inference) so it would be good to implement it just once.UniformFrequencyDomain
toFrequencyDomain
and change the string "uFD" to "FD". A uniform frequency domain is standard, so no sense confusing people. We can later introduceNonuniformFrequencyDomain
.noise_std
does belong with the domains, because it depends only on the domain. Maybe it should be renamedwhite_noise_std
, since it is the standard deviation for white noise in each bin. ForNonuniformFrequencyDomain
this could get more complicated, as it would be frequency-dependent, so we need to think carefully about how to implement that.window_factor
should maybe be moved out ofFrequencyDomain
, since this depends on how we take our FFTs when we estimate the noise PSD, and this is not necessarily known when we use the domain to build a dataset of waveforms. However the best place for this is not obvious; maybe it belongs with the noise, which we haven't really dealt with yet. Any thoughts?In general the only domain we care about at this point is
FrequencyDomain
, and later we almost certainly also wantNonuniformFrequencyDomain
. It's okay to leaveTimeDomain
not fully implemented throughout for now.