Define how sources should be defined #61

Closed bilby-bot closed 2 weeks ago

In GitLab by @git.ligo:gregory.ashton on Mar 29, 2018, 00:52

There are several questions to be answered with respect to how a source is defined. Let's resolve this here.

I can see three ways the source can be implemented. I'm going to list these first, along with how they would then be used by the sampler (in any of the three cases, the user will probably just input a string saying they want a 'cbc' waveform etc).

A) Class with parameters defined at initialisation

Definition:

class Source:
    def __init__(self, name, parameters):
        self.name = name
        self.parameters = parameters

    def model(self, time):
        return parameters['m'] * time + parameters['c']

    def update_parameters(self, parameters):
        self.parameters = parameters

Use in the sampler:


# Somewhere in the initialisation of the sampler
my_source = Source('my_source', parameters)

...

# Somewhere in the likelihood
my_source.update_parameters(parameters)
waveform = my_source.model(time)

B) Class with parameters passed to model

Definition:

class Source:
    def __init__(self, name):
        self.name = name

    def model(self, time, parameters):
        return parameters['m'] * time + parameters['c']

Use in the sampler:


# Somewhere in the initialisation of the sampler
my_source = Source('my_source')

...

# Somewhere in the likelihood
waveform = my_source.model(time)

C) Model is a function

Definition:

def model(time, parameters):
   return parameters['m'] * time + parameters['c']

Use in the sampler:


# Somewhere in the initialisation of the sampler
my_source = model # This is a function, not a class instance

...

# Somewhere in the likelihood
waveform = my_source(time, parameters) # Or equivalently, model(time, parameters)

Summary

There is first an argument to be had about whether it should be a class or a function (i.e. A or B vs. C). I'm personally pro class, because it allows for future flexibility. For example, if you need to initialise an approximant or load some data etc.

Second, is the argument between A and B. I believe most of the sources where initially defined as in A (but without an update parameters method). Either will work in practise.

Personally, I think B makes more sense because there is no permanence to the set of parameters - they will get updated on every evaluation of the likelihood. One may then wonder, "if you go with B, why not just make it a function, i.e. C?" To illustrate why not, here is an example

Definition:

class Source:
    def __init__(self, name):
        self.name = name
        self.load_some_data()

    def load_some_data(self):
        self.data = load('my_big_data_file')

    def model(self, time, parameters):
        # here you do something with the data, e.g.
        m = parameters['m'] * self.data
        return m * time + parameters['c']

In this case, making it a function would require you to either load the data everytime you evaluate model, or load the data and pass it to model in the likelihood. This would be cumbersome.

TLDR: The argument for classes is it allows flexibility. The argument for B is that parameters are not data that needs to be saved or known by the source model. However, there could well be data that should be saved by the class, as in the example above.

I'll leave this open to debate and someone else (perhaps the BDFL @git.ligo:paul-lasky ) to decide which. Once decided, I'm happy to implement any of the three for out current models.

@git.ligo:moritz.huebner @git.ligo:paul-lasky

In GitLab by @git.ligo:colm.talbot on Mar 30, 2018, 00:41

I left this comment (slightly differently) on https://git.ligo.org/Monash/peyote/issues/6, although maybe it would have been better here. (Are these duplicate issues, or am I missing something?)

I'd also prefer a class, I'd vote for standardisation in the output (personally I don't like having a model method).

class GravitationalWaveSource:
    ...
    def frequency_domain_strain(self, f_array, parameters):
        ...
        return {'plus':foo, 'cross':bar, ...}

    def time_domain_strain(self, t_array, parameters):
        ...
        return {'plus':foo, 'cross':bar, ...}

These could generically be related by an FFT if that's natural.

We can still have the simpler usage that is given by making your source a single method by adding a call function.

I'm going to implicitly assume we're defaulting to FD.

    def __call__(self, f_array, parameters):
        return frequency_domain_strain(self, f_array, parameters)

In GitLab by @git.ligo:gregory.ashton on Mar 30, 2018, 01:55

+1 for @git.ligo:colm.talbot suggestion - this would resolve time/frequency domain problems. Should we also rename the logL likelihood logL_gravitational_wave or something since its not an arbitrary, but specific?

Also yes - these may be duplicate issues - my bad.

In GitLab by @git.ligo:gregory.ashton on Apr 2, 2018, 00:55

mentioned in commit a1c8dab6363136b543a2fbf63aa56c396149fd13

In GitLab by @git.ligo:gregory.ashton on Apr 4, 2018, 06:31

Okay, I'm going to close this. We can reopen in future if people want, but for now the code is written using option B

In GitLab by @git.ligo:gregory.ashton on Apr 4, 2018, 06:31

closed

In GitLab by @git.ligo:moritz.huebner on Apr 10, 2018, 10:19

reopened

In GitLab by @git.ligo:moritz.huebner on Apr 10, 2018, 11:22

Sorry for reopening this one week after the fact, but I was gone last week so I was not able to contribute. I strongly object the way that the sources are currently being implemented.

The reason for this is, that currently dictionaries (that are defined by a list of strings, which is a static variable within the source class) are being used as data containers. There is absolutely no point in doing this.

@git.ligo:gregory.ashton wrote that "Personally, I think B makes more sense because there is no permanence to the set of parameters - they will get updated on every evaluation of the likelihood."

Going from your original proposals, version A would definitely be the typical/conventional way to implement classes. There is no advantage in keeping the data in a python dictionary instead. Classes in object oriented programming are supposed to encapsulate the implementation as far as possible. By having the parameters implemented separately, it will require the user to now carry around both the python dictionary and the class instance, instead of just having to deal with the class instance.

With respect to performance, there should not really be an issue in containing the data within a class instance compared to a dictionary. If we are estimating parameters for a single event, you will only need one class instance.

The main issue with not having it the 'conventional' way is readability for outsiders at this point. If I'm reading code, I expect that the actual parameters are being saved within the class instance, because it is the convention. And that is also where this can cause us trouble later on. You won't find good tutorials/best practice guides/sample codes that do it this way. And while I have no doubt that you can solve any arising problems eventually, it will make it way harder to search for solutions. For example, you could not use the standard python patterns for unit testing. You can not use normal design patterns to check if the values used for the instantiation are allowed.

At this point, I have a branch, where I changed all the stuff, so that it does not rely on the dictionaries anymore but instead implements the parameters as attributes in the respective class. It is fully functional - the tutorials are running fine - and is on the git as the 'source_reimplementation' branch ready to be merged. The change overhauls quite a lot of stuff, which is why I have not pushed it. If you are using a Python script, the main changes will now be, that you instantiate classes the following way:

` source = peyote.source.BinaryBlackHole('BBH', sampling_frequency, time_duration, mass_1=36., mass_2=29., spin_1=[0, 0, 0], spin_2=[0, 0, 0], luminosity_distance=100., iota=0.4, phase=1.3, waveform_approximant='IMRPhenomPv2', reference_frequency=50., ra=1.375, dec=-1.2108, geocent_time=1126259642.413, psi=2.659)

`

Instead of defining a dictionary. And in order to change attributes, instead of

prior['mass_1'] = peyote.parameter.Parameter( 'mass_1', prior=peyote.prior.Uniform(lower=35, upper=37), latex_label='$m_1$')

You will now see

prior.mass_1 = peyote.parameter.Parameter( 'mass_1', prior=peyote.prior.Uniform(lower=35, upper=37), latex_label='$m_1$')

In GitLab by @git.ligo:moritz.huebner on Apr 10, 2018, 13:33

I may also add, that it is rather easy to create a key-value dictionary from any class instance with instance.__dict__ anyway, which is also how I resolved most issues.

In GitLab by @git.ligo:gregory.ashton on Apr 10, 2018, 23:32

Here are a few thoughts on this - I've tried to organise them to avoid this being one big block of text. I'm personally for the current implementation and these are the reasons why.

1) User interface

By having the parameters implemented separately, it will require the user to now carry around both the python dictionary and the class instance, instead of just having to deal with the class instance.

The users don't have to carry around the source and the parameters though. The user specifies only a likelihood and a prior which is passed to run (or the sampler). All the complexity is hidden inside tupak.

Moreover, we still (on your branch current) have an active_parameter_values dictionary. But rather than passing this to loglikelihood(active_parameter_values) we set it and then call loglikelihood(). So we haven't reduced the amount of data that is being handled by making it part of the source class. We just obscure how that data is passed around.

Finally, I see the source.py class as part of the user interface. People will be writing there own sources and we must make this as easy as possible for them (more on this in a moment).

2) Instantiating sources

I think my fundamental issue can be summed up by looking at the BasicTutorial.py file on your branch where the source in initialised as follows

source = peyote.source.BinaryBlackHole('BBH', sampling_frequency, time_duration, mass_1=36., mass_2=29.,
                                       spin_1=[0, 0, 0], spin_2=[0, 0, 0], luminosity_distance=100., iota=0.4,
                                       phase=1.3, waveform_approximant='IMRPhenomPv2', reference_frequency=50.,
                                       ra=1.375, dec=-1.2108, geocent_time=1126259642.413, psi=2.659)

This makes perfect sense if you are injecting a source...but what about if you are searching some data where you don't know any of the parameters yet? How should the user define the source?

Well okay you could say let's give them all default values of None and then run through and check if they are set and if not they will be overwritten by the sampler. But this opens the door to unexpected behaviour. For example if someone writes there own source function and sets the default to 0..now they will be very confused why their results always come out wrong without raising an error.

There are ways around this of course - adding checks etc.

3) Misc.

For example, you could not use the standard python patterns for unit testing. You can not use normal design patterns to check if the values used for the instantiation are allowed.

But the instantiation is only ever going to be valid for injections. And you can just use a normal routine to check the parameters are valid.

This may be unrelated, but the branch has this in the source.py class, it appears to call itself.

    def copy(self):
        copy = BinaryBlackHole(self.name, self.sampling_frequency, self.time_duration, self.mass_1, self.mass_2,
                               self.luminosity_distance, self.spin_1, self.spin_2, self.iota, self.phase,
                               self.waveform_approximant, self.reference_frequency, self.ra, self.dec,
                               self.geocent_time, self.psi)
        return copy

Maybe this is a clever recursive trick, but if this is required for every source class I think it is a bad idea - we hope that non-experts will be able to write source classes and requiring things like this will not be appealing.

Summary

I don't think this current implementation breaks with convention nor will it cause any difficulties. Philosophically, the parameters are not data of the source object. This becomes evident when instantiating a new source - we should not require a set of parameters - we don't mean a a BBH source with some specific masses etc, but a general BBH source with any masses etc.

In GitLab by @git.ligo:gregory.ashton on Apr 10, 2018, 23:34

But some common ground - I do actually think we are handling to many parameters in the parameters dictionary. Certain things that will never be searched over should be removed (for example the waveform_approximant). The parameters dictionary should only ever contain floats (since this is what our nested samplers/MCMC can handle).

In GitLab by @git.ligo:gregory.ashton on Apr 10, 2018, 23:35

I'm also happy to be over ruled on this. Ultimately if it works I don't care. I'm resisting because I think the current method makes more sense and will reduce the time to get a usable code out.

In GitLab by @git.ligo:moritz.huebner on Apr 11, 2018, 00:07

Just to add on this, I didn't implement a perfect solution within my branch yet and I also fully concede that it works the way it is implemented in origin master right now. With respect to the sampler, I have only made it so that it works with my changes right now. I can improve the look of that later.

I'm just really objecting this, because this goes against the experience I have from working with object orientation both at uni and in industry. Implementing the instance attributes outside the class itself definitely breaks with convention and you definitely won't find it to be this way if you read tutorials or books about object orientation. In case you can find any sort of legitimate guide or book that recommends it, I am happy to read that, but I do not think that you could find that.

On 2)

This makes perfect sense if you are injecting a source...but what about if you are searching some data where you don't know any of the parameters yet? How should the user define the source?

I agree on that issue in the sense that the current way that we implement parameters is not that good. Parameters can either be primitive types, such as floats, or an instance of the Parameter() class (i.e. with a prior distribution attached and so on). This is definitely not a good practice in any case, because you can't easily know which is which later on (You solved this within the sampler by checking what type the parameters are). Python does not enforce type security (like Java or C), which means that types aren't set in stone. That does not mean, however, that you should not try to keep it consistent.

Without going into any implementation details, a better practice would be to implement all parameters as Parameter(), and use the class constructor to instantiate each parameter, so that the user can still use the same code for instantiation. You can use this, to assert that all parameters are of the same type and it does not require you to change anything about the way you instantiate the source.

On 3)

This is just an easy copy function, that allows you to create a copy of the current instance. This just overrides a standard method from the Object() class.

Regarding your summary

Philosophically, the parameters are not data of the source object.

This is just not true. The parameters are data of a source object. There is not the source object, because you may want to create many instances of the one source class. Again, this is just a basic practice in object oriented programming. Any instance of the class should have a set of parameter values.

we don't mean a a BBH source with some specific masses etc, but a general BBH source with any masses etc.

That's what you define within a class. More on this later, and I am happy to meet up. Sorry for having to be so adamant about it, but the way I see it right now is, that it will just cause us more trouble down the road if we don't fix it right away.

In GitLab by @git.ligo:gregory.ashton on Apr 11, 2018, 10:11

We can close this now right?

In GitLab by @git.ligo:moritz.huebner on Apr 11, 2018, 13:25

closed

In GitLab by @git.ligo:michael.williams on Oct 3, 2024, 17:57

unassigned @git.ligo:gregory.ashton

bilby-dev / bilby