Domains - Githubissues

stephengreen commented 2 years ago

Just a few suggestions for how we implement domains, the first one being the most significant:

Implement a time_translate method within each domain. That way the calling function does not have to know how to implement this for each domain. (It could get complicated for TimeDomain, for instance.) This method should take an array of strain data and an amount to time translate by. Time shifting gets called at several points throughout the code (waveform generation, detector projection, GNPE inference) so it would be good to implement it just once.
Rename UniformFrequencyDomain to FrequencyDomain and change the string "uFD" to "FD". A uniform frequency domain is standard, so no sense confusing people. We can later introduce NonuniformFrequencyDomain.
I think noise_std does belong with the domains, because it depends only on the domain. Maybe it should be renamed white_noise_std, since it is the standard deviation for white noise in each bin. For NonuniformFrequencyDomain this could get more complicated, as it would be frequency-dependent, so we need to think carefully about how to implement that.
window_factor should maybe be moved out of FrequencyDomain, since this depends on how we take our FFTs when we estimate the noise PSD, and this is not necessarily known when we use the domain to build a dataset of waveforms. However the best place for this is not obvious; maybe it belongs with the noise, which we haven't really dealt with yet. Any thoughts?

In general the only domain we care about at this point is FrequencyDomain, and later we almost certainly also want NonuniformFrequencyDomain. It's okay to leave TimeDomain not fully implemented throughout for now.

max-dax commented 2 years ago

I Implemented most requirements for the UniformFrequencyDomain.

I added the time_translate_data method. It currently only supports complex numpy arrays with a single time shift across all channels, since this is what we need for detector projections. We will later also need to implement an efficient method time_translate_batch, that applies individual time shifts to strains in different detectors, across an entire badge, on data where real and imaginary part are split in different channels. This needs to be very efficiently implemented, since we call it many times for the GNPE iterations. I know how to do it, but I leave it for later since it involves a few subtleties.
I also built infrastructure to truncate the data. This is required if we want to restrict the frequency range compared to the saved waveform dataset. In the research code, this has always been done by hand. I now implemented a method for it. It is fully compatible with time_translate_data, it will automatically figure out whether translation is to be performed on truncated or original data.

The other issues are mostly style related. I will have a look at them later, but for now I focus on what's required to get the prototype running.

max-dax commented 2 years ago

I also made a few changes such as moving build_domains to domains.py, and adding the domain_dict which allows for recovery of the domain via the build_domain function.

stephengreen commented 2 years ago

For the data truncation, the only use case I can think of is when we have to generate EOB waveforms starting from much lower frequency than is ultimately desired. Is there another use case you have in mind?

max-dax commented 2 years ago

We use it all the time. For IMRPhenom, we save the waveforms with frequencies in the range [0, 1024], but only use [20, 1024] for training. The truncation method takes care of that. In fact, we can even apply it to the compression Vh matrix, such that the decompressed output is already truncated. Generally we want to be able to change the frequency range in train_config.yaml without having to regenerate an expensive dataset.

stephengreen commented 2 years ago

Looking at the code, I believe that the domain truncation could be simplified if the frequencies between 0 and f_min are chopped off at the very end of the transform sequence, rather than the beginning. This is how it was done in the old code:

https://github.com/stephengreen/lfi-gw/blob/f4a8aceb80965eb2ad8bf59b1499b93a3c7b9194/lfigw/waveform_generator.py#L1598-L1603

The current approach requires significant duplication:

UniformFrequencyDomain has len/len_truncated, sample_frequencies/sample_frequencies_truncated, as well as two cases in time_translate_data.
Often when new features are added, they require special treatment for truncated cases, e.g., in the ASDDataset and the WaveformDataset. This typically takes the form of a truncate_dataset_domain method.
When new features are added that deal directly with the data (e.g., factoring out the chirp), this will require two cases, as in time_translate_data.

Moving truncation to the final preprocessing step will add about 2% to preprocessing costs (since we have to carry around frequencies below f_min), but it would mean the code is simpler and more maintainable (since we won't have to think about whether we are working with truncated or non-truncated data), and also it is consistent with standard assumptions in LIGO software (e.g., waveform generation routines).

The above discussion only applies to frequencies between 0 and f_min. We can still maintain the capability to reduce the frequency range from the dataset frequency range by specifying new f_min and/or f_max. This would be implemented via an argument to WaveformDataset.__init__ that modifies the dataset while loading it:

To increase f_min, the dataset (or svd matrix) is zeroed below the new f_min.
To decrease f_max, the dataset (or svd matrix) is chopped off above the new f_max.

This way, though, we can always assume frequencies start at 0.

mpuerrer commented 2 years ago

This sounds like a good solution to me. It's important that the code is easy to understand and it seems that the current code is perhaps more complicated than it needs to be and I'd choose a 2% increase in computational cost for significantly easier code in a heartbeat. That's just my opinion and Max should comment as well.

dingo-gw / dingo

Domains #14