Closed mpuerrer closed 2 years ago
Hmm, I don't see an explicit size attribute in the WaveformDataset class.__len__(self)
returns len(self.parameters)
which would have been updated automatically.
Maybe you are thinking about a different place downstream where this could cause an issue?
The dataset size as specified in the yaml file is num_samples
(or num_training_samples
for the SVD related dataset). That number would indeed be out of sync if one or more waveforms had failed to generate, but I'm not sure whether we'd want to overwrite the user input yaml file by default.
Regarding the suggested alternative, I agree, that it would be a bit cumbersome to replace the failed waveforms.
Ah, yes, num_samples
. The entire YAML file contents get saved along with the dataset, including this number. I think we would have to override it.
I also found a bug in the main dataset construction that I had missed before.
The transform is not being applied upon failure. I don't like that the transform is applied at the end of generate_hplus_hcross()
. This should be decoupled and done in a separate step. I should not need to repeat the transform step for multiple exit points of generate_hplus_hcross()
and given the name generate_hplus_hcross()
I'd expect this method to do just that and no more and the application of the transform can happen in a separate function call.
num_samples
Ok, so just updating dataset_dict[settings['num_samples']]
should fix that, I think.
Well it's still generating hplus and hcross, just in a different (compressed) representation. We do it this way because it's totally analogous to how such transforms are applied in the WaveformDataset, in particular the call is the same whether it compresses or not. More importantly, though, it means that we don't carry around a bunch of uncompressed waveforms when we don't need to.
I can see a reason why one might want to apply the transform to a waveform that fails to generate: if it changes the dimensionality of the data (e.g., compression), then not applying it could cause a problem when stacking the collection of waveforms. I see why you are sending the array of NaN rather than raising an error (so that the map
in generate_waveforms_parallel
functions properly). But what if instead of returning on
one sets
wf_dict = {'h_plus': pol_nan, 'h_cross': pol_nan}
Then the transform is applied, and I think it should still return NaN, but with the correct dimension.
num_samples
Ok, so just updating
dataset_dict[settings['num_samples']]
should fix that, I think.
Right, I think that should do it.
I can see a reason why one might want to apply the transform to a waveform that fails to generate: if it changes the dimensionality of the data (e.g., compression), then not applying it could cause a problem when stacking the collection of waveforms. I see why you are sending the array of NaN rather than raising an error (so that the
map
ingenerate_waveforms_parallel
functions properly).
Indeed.
But what if instead of returning on
one sets
wf_dict = {'h_plus': pol_nan, 'h_cross': pol_nan}
Then the transform is applied, and I think it should still return NaN, but with the correct dimension.
Yes, that will work. (I though of putting it into a finally clause, but that should be reserved for cleanup actions.)
I understand why what is in the code is convenient. Still, I think the fact that the transformation is being applied in generate_hplus_hcross()
is misleading and confusing. A function or method should do only the one thing that its name implies. It is also not immediately obvious what it will output given that it all depends which transformation has been added -- In fact I was baffled for about half an hour by the strange bug and then realized that a transform had been snuck in.
By having another wrapper transform()
around generate_hplus_hcross()
this would be much clearer and also would have avoided carrying unused data along.
The new version makes the two changes we just discussed and runs properly for me in a small test case.
SEOBNRv4PHM_maximum_starting_frequency()
for estimating the largest starting frequency for EOB, given a total mass value. This should be used when setting up a data set.I think this is ready to be merged.
Thanks @mpuerrer, looks good. To clarify, for EOB waveforms, if the user sets f_ref != f_min
in the settings file, will LAL raise an error because it cannot generate such a waveform?
A couple other points came to mind:
f_min
? If that's the case, we should zero it below f_min
. In fact, the WaveformDataset
does this automatically, but this only gets invoked when using the DAG waveform dataset generation; hence, we could get different results depending on whether we use the DAG or not. The problem is averted if the waveform generator zeroes it. Alternatively, I could update the non-DAG script to use the WaveformDataset
to zero it.f_min
smaller than we would ultimately want, it might make sense to have a separate f_min
for waveform generation and for the dataset itself. Then after generating the waveforms, we zero them below the latter f_min
. It would be useful for this operation to take place at dataset generation time rather than just prior to inference because we train the SVD compression at dataset generation time, and there is no point training an SVD that can represent the very low frequencies that we don't care about. What do you think? Alternatively, we could just whiten the waveforms, and the detector noise level at low frequencies would mean the SVD effectively ignores these frequencies.Actually, I think the easiest way to implement these things is to
f_ref
for EOB models, andWaveformDataset
to zero the waveforms below f_min
after generation.That way f_min
really represents the lower bound on the likelihood integral, as we intended, and we can still generate EOB waveforms. It should be pretty clean to implement; I can take care of it.
if the user sets
f_ref != f_min
in the settings file, will LAL raise an error because it cannot generate such a waveform?
It looks like, lal just silently sets f_ref = f_min
for precessing EOB waveforms. No warning or error in a standard environment.
I think we should be displaying a warning message if the user specified f_ref != f_min
when precessing EOB is being used.
- Is it possible that the generated waveform is nonzero below
f_min
?
I don't think that it can be nonzero, since no EOB waveform has been generated below f_min
; your suggestion of zeroing out the returned waveform below f_min
is certainly safe.
- Have the waveform generator automatically reduce the starting frequency to
f_ref
for EOB models, and
Hmm, I'm not sure that would be "expected behavior". I think that f_min
should dictate the starting frequency. f_ref != f_min
should be flagged with a warning and then f_ref
should be reset to f_min
.
This leaves open your idea to specify another frequency to zero out the waveform below which I think is a useful idea in practice. It should have a distinct and ideally self-explanatory name.
I added a new option f_start
, which specifies the starting frequency for waveform generation. f_min
then refers to the equivalent of the lower bound on the likelihood integral, i.e., it zeroes waveforms below f_min
. f_start
is optional, and if not specified, it is taken as f_min
.
This discussion also now connects to that in #71. Changes there mean that waveforms are zeroed below f_min
prior to constructing the compression SVD.
Great, I renamed the starting frequency, and also the reference frequency (which had two names as well).
Sounds good. I think that's just a tiny bit clearer that way!
Example settings file which gave me a ~ 50 % failure rate.