Wavelet Log Likelihood (Carter & Winn 2009)

exoplanet-dev / exoplanet

Fast & scalable MCMC for all your exoplanet needs!

https://docs.exoplanet.codes

MIT License

206 stars 52 forks source link

Wavelet Log Likelihood (Carter & Winn 2009) #76

Closed exowanderer closed 4 years ago

exowanderer commented 4 years ago

Is your feature request related to a problem? Please describe. I'm using exoplanet to integrate PyMC3 with efficient transit modeling. I have the base system running smoothly -- thanks to great tutorials. Thank you.

I was requested by a collaborator to include the Carter & Winn (2009) Daubechies wavelet likelihood.

(1) I am able to generate log_likelihood as a float value. So that first feature request that I would ask is how to add that into the xo.optimize or pm.sample calls?

(2) If we can bypass wavelets altogether by using a GP to estimate sigma_r and sigma_w efficiently, then I would be more than happy to do that. We've previously discussed using a GP to model the Daubechies wavelet likelihood; but I would need for help in selecting the GP+Kernels combinations to do it.

As a secondary feature request, I would ask for a tutorial to estimate the residual red noise in a light curve the way that our community to put in tables.

Describe the solution you'd like I see from #70, that adding in a likelihood function is feasible; but I get the impression that there is more to do than a simple xo_logp + wavelet_logp.

Is there a way to take a function (with 2 inputs + 3 PyM3 parameters) and "add" it to the likelihood computed inside exoplanet?

Describe alternatives you've considered I grabbed the necessary bits of code form @pcubillos's package mc3 https://github.com/pcubillos/mc3/blob/master/mc3/stats/stats.py [line 209]

I can compute the wavelet likelihood (outside of PyMC3+XO) as a float value; but how to wrap this into something that xo can interpret as an addition to the log-likelihood built in?

Additional context If there is a much simpler way to use the GP module to estimate sigma_r and sigma_w, then I could bypass this whole issue.

At the same time, others may still want a wavelet likelihood added onto their light curve posteriors. As such, may I suggest this feature overall.

exowanderer commented 4 years ago

In a directly related problem, I am able to compute the wavelet likelihood given the model_prediction and data in np.array format, to later subtract as the residuals.

The function call is

dwt_chisq(model_prediction, data, [gamma, sigma_r, sigma_w])

To run this with xo via PyMC it places model_prediction as a tensor (i.e. theano.tensor.var.TensorVariable); also, gamma, sigma_r, and sigma_w are each of type theano.tensor.var.TensorVariable.

In order to run the dwt_chisq as is, I would need to output a numpy array from each of those tensors. What is the appropriate method to do so?

I tried .eval(); but that had very little useful outputs. Please let me know if I'm off base, such as "it's not possible to output a np.array type" or "that's now how you use .eval()"

Thank you!

dfm commented 4 years ago

Hi Jonathan,

The catch here is that you need to provide the wavelet likelihood as a Theano op and you need to be able to compute the gradient of this operation with respect to its inputs. Here's how I worked out the gradients for celerite and I bet something similar would be possible (and maybe not even too onerous!) for the C&W method.

But, that being said, I wouldn't bother! You can use celerite and it'll be just as fast and have fewer restrictions (it can model more complicated power spectra and it doesn't require evenly spaced data). If your model is something like:

log_s2 = pm.Normal(...)
log_Sw4 = pm.Normal(...)
log_w0 = pm.Normal(...)

kernel = xo.gp.terms.SHOTerm(log_Sw4=log_Sw4, log_w0=log_w0, log_Q=1/np.sqrt(2))
gp = xo.gp.GP(kernel, x, yerr ** 2 + pm.math.exp(log_s2))

then log_s2 is the variance of the white noise component and log_Sw4 will be related to the "red noise" variance that you want. Honestly it's not obvious that the sigma_r parameter is a very meaningful number since its value will depend sensitively on the specific choice of power spectrum model, but it should be possible to calibrate a relationship between that number and the parameters of a celerite model if you want. Take a look at Section 4 of the celerite paper to see more about the model specification there.

Hope this helps!

ericagol commented 4 years ago

@exowanderer

To add to what @dfm said, the wavelet model from the Carter & Winn paper solves for the correlated noise component assuming a specific power-spectrum: f^{-1}, aka pink noise. (Their paper in principle allows for other power-laws, f^{-\nu}, but the fast scaling of the algorithm only works for \nu=1, if I remember correctly).

This power-spectrum is not a great description of stellar variability. It has a scaling at high frequency which causes sharper variations in the correlated noise component that does a stochastically-driven simple damped harmonic oscillator. It is also even stronger at high frequency than a damped random walk (aka Ornstein-Uhlenbeck process). In practice, however, what I expect this means is that the short-timescale variability will be conflated with the white noise component of variability.

Also, at low frequencies f^{-1} is too large - in fact, it diverges, which means that there is power on long timescales. In practice this may not be too much of an issue since any dataset has a finite duration. So, I think f^{-1} worked well for Carter & Winn thanks to these saving factors.

That said, there may be ranges of frequency where an f^{-1} slope is valid, and that can be approximated well with, e.g., three Q=1/2 celerite terms, as in this plot:

CarterWinn_vs_celerite

Comparing these, you can see that the pink noise continues to rise at low frequencies, and dominates at high frequencies. But, as I mentioned above, a white-noise component will always dominate at the highest frequencies. So, as Dan mentioned, you can approximate this wavelet spectrum over some dynamic range of the power spectrum with the celerite GP.

dfm commented 4 years ago

@ericagol: Do you mean Q=1/sqrt(2) rather than 1/2?

ericagol commented 4 years ago

I used Q=1/2 for this plot. Here's the Julia code:

using PyPlot
clf()
nu = 10.0.^(collect(-1:0.01:1.5))
s(s0,nu0,nu) = s0/((nu/nu0)^2+1)^2

s1 = s.(1.0,1.0,nu)
s2 = s.(0.25,4.0,nu)
s3 = s.(0.0625,16.0,nu)
loglog(nu, s1 .+ s2 .+ s3, label=L"3 Q=$1/2$ terms")
loglog(nu, 0.52 ./nu, label = L"Pink: $f^{-1}$")
loglog(nu, s1, alpha=0.3,color="C0")
loglog(nu, s2, alpha=0.3,color="C0")
loglog(nu, s3, alpha=0.3,color="C0")
legend()
axis([0.1,30.0,1e-2,10.0])
xlabel(L"$\omega$")
ylabel(L"$S(\omega)$")

dfm commented 4 years ago

@Ericagol: there will be numerical issues for Q=1/2 so I generally recommend Q=1/sqrt(2) or something else that's not exactly 1/2.

ericagol commented 4 years ago

Right, that's the Matern kernel. Here's a case with Q=1/3 which looks even better.

CarterWinn_vs_celerite_Qthird

using PyPlot
clf()
nu = 10.0.^(collect(-1:0.01:1.5))
s(s0,nu0,nu,Q) = sqrt(2/pi)*s0*nu0^4/((nu^2-nu0^2)^2+(nu*nu0)^2/Q^2)

#loglog(nu,1.0./(nu.^2 .+1).^2 .+ 0.25 ./ ((nu./4).^2 .+1).^2 .+ 0.0625 ./ ((nu ./16).^2 .+ 1).^2)
s1 = s.(1.4,0.8,nu,0.33)
s2 = s.(0.25,4.0,nu,0.33)
s3 = s.(0.0725,20.0,nu,0.33)
loglog(nu, s1 .+ s2 .+ s3, label=L"3 Q=$1/3$ terms")
loglog(nu, 0.27 ./nu, label = L"Pink: $f^{-1}$")
loglog(nu, s1, alpha=0.3,color="C0")
loglog(nu, s2, alpha=0.3,color="C0")
loglog(nu, s3, alpha=0.3,color="C0")
legend()
axis([0.1,30.0,1e-2,10.0])
xlabel(L"$\omega$")
ylabel(L"$S(\omega)$")

exowanderer commented 4 years ago

log_s2 = pm.Normal(...)
log_Sw4 = pm.Normal(...)
log_w0 = pm.Normal(...)

kernel = xo.gp.terms.SHOTerm(log_Sw4=log_Sw4, log_w0=log_w0, log_Q=1/np.sqrt(2))
gp = xo.gp.GP(kernel, x, yerr ** 2 + pm.math.exp(log_s2))
then log_s2 is the variance of the white noise component and log_Sw4 will be related to the "red noise" variance that you want.

You are both great! Thank you for the discussion; and, especially, the code segments. I'm a big fan of GPs. I just have a lot of problems picking out the 'best' kernel+hyperparameter combos, for a given problem set.

For instance, I read many sections of the the celerite paper; and, we had talked in person about using the SHO kernel in replacement for the CW-Wavelets. My issue is that I find it difficult to know which SHO hyperparameters to set. I would easily have let Q float, instead of setting it to 1/3 or 1/sqrt(2). I still don't fully know what prior bounds to place on the log_s2, log_sw4, log_w0 terms; but I'll probably model it after these examles: example1, example2.

Side question: what did you (both) mean, when you said "Q=1/2 is the Matern kernel". Related to the above question, I was already trying out the Matern kernel, because I thought that it was the appropriate method to characterize "1/f" noise.

This discussion has been a big help. I already cited the exoplanet+celerite paper; I think that I'll also add a footnote to this discussion.

exowanderer commented 4 years ago

As a secondary conversation: when collaborating on an observation paper, the conversation often starts with 'we need to measure the red noise'; but then it also includes 'GPs are too complicated/versatile'. I have personally said, and heard from others, phrases such as "GPs may eat the transit/eclipse"; i.e. when fitting a GP freely, they can 'model' the transit/eclipse as a correlated noise source.

In the framework of academic inertia, may I suggest that we could make small examples (i.e. notebooks) that show how to use exoplanet+celerite to model previously established norms in our field.

In the celerite paper [page 34], you discuss CW-wavelets, their limitations, and how to approximate them using celerite with S(w) ~ w^(-1). I didn't know what that would look like in code format until today.

I can also make a notebook comparing CW-Wavelets to Celerite+SHO as well. I'll try to make that soon -- i.e., this semester -- then PR it for you to include if you wish. I think that our community would find it helpful to know where to use celerite as an improvement (or just replacement) for the existing methods.

Thank you both; and the team for this great work!

exowanderer commented 4 years ago

Combining example1 and example2, here is what I came up with:

def build_gp_pink_noise(time, data, dataerr, Q=1.0 / np.sqrt(2)):

    log_S0 = pm.Normal("log_S0", mu=0.0, sigma=15.0,
                       testval=np.log(np.var(data)))
    log_w0 = pm.Normal("log_w0", mu=0.0, sigma=15.0,
                       testval=np.log(3.0))
    log_sw4 = pm.Deterministic("log_variance_r", log_S0 + 4 * log_w0)

    log_s2 = = pm.Normal("log_variance_w", mu=0.0, sigma=15.0,
                         testval=np.log(np.var(data)))

    kernel = xo.gp.terms.SHOTerm(log_Sw4=log_Sw4, log_w0=log_w0, log_Q=np.log(Q))
    gp = xo.gp.GP(kernel, time, dataerr ** 2 + pm.math.exp(log_s2))

    return pink_gp

such that my script, which also creates PyMC3 priors for eclipse_depth, mean, etc inside:

with pm.Model() as model:
    mean = pm.Normal(f"mean", mu=1.0, sd=1.0)

    edepth = pm.Uniform("edepth", lower=0, upper=0.01)
    sqrt_edepth = pm.math.sqrt(edepth)

    # Compute the model light curve using starry
    # Set up a Keplerian orbit for the planets
    light_curve = star.get_light_curve(orbit=orbit, r=sqrt_edepth, t=times)

    # The likelihood function assuming known Gaussian uncertainty
    pm.Normal(f"obs", mu=light_curve, sd=dataerr, observed=data)

    # Build GP model
    gp = build_gp_pink_noise(time, data, dataerr, Q=1.0 / np.sqrt(2))

    map_soln = xo.optimize(start=model.test_point)
    trace = pm.sample(
        tune=3000,
        draws=3000,
        start=map_soln,
        chains=mp.cpu_count(),
        step=xo.get_dense_nuts_step(target_accept=0.9),
        cores=mp.cpu_count()
    )

[New Question]: In my example above, are log_s2 and log_S0 the exact same thing? or do I need both variables? Should I (effectively) set log_s2 = log_S0 and log_SW4 = log_S0 + 4 * log_w0?

dfm commented 4 years ago

No - log_s2 is the white noise variance. There's no reason why this should be related to the GP amplitude!

exowanderer commented 4 years ago

I created a synthetic data set with pink noise, and then modeled that data with exoplanet + celerite. The full example, which runs from end-to-end for me, can be found here

The important part for this discussion is included here


def build_gp_pink_noise(times, data, dataerr,
                        log_Q=np.log(1.0 / np.sqrt(2))):
    ''' Build a pink noise GP iterating over S0, w0, Sw4

        times (nDarray): time stamps for x-array
        data (nDarray): normalized flux for transit light curve
        dataerr (nDarray): uncertainty values per data point
        log_Q (float): hyperparameter for SHO kernel [Q=1/sqrt(2): pink noise]
    '''
    log_w0 = pm.Normal("log_w0", mu=0.0, sigma=15.0,
                       testval=np.log(3.0))
    log_Sw4 = pm.Normal("log_variance_r", mu=0.0, sigma=15.0,
                        testval=np.log(np.var(data)))
    log_s2 = pm.Normal("log_variance_w", mu=0.0, sigma=15.0,
                       testval=np.log(np.var(data)))

    kernel = xo.gp.terms.SHOTerm(
        log_Sw4=log_Sw4, log_w0=log_w0, log_Q=log_Q)

    return xo.gp.GP(kernel, times, dataerr**2 + pm.math.exp(log_s2))

I was able to run this model and converge to a meaningful solution; but I have 3 questions about this.

Is this model what you were suggesting?
The results that I am getting are not within my expectations: log_s2 is ~ -30 (very very small) log_Sw4 is ~ -5; which is small, but always much larger log_s2.

I was expecting the log_s2 ~ -10 (i.e. 50 ppm) and the log_Sw4 ~ -12 (i.e. 5 ppm). I think that I set the pink scatter to be 10% of the white scatter (see example, lines 186 + 207: with sigma_r=0.1

Would you happen to know why I am finding a much larger log_Sw4 than log_S2? I assume that I screwed something up.
How does the GP method interface with an AICc / BIC calculation?

I fit 32 models (other than GPs) with a range of AICc / BIC values from 150-350. When I add the GPs, should I assume 3 extra parameters? I am finding improved logp (within 10%). The AICc and BIC both increase by ~4% (more than enough to reject the GP models).

But I don't know if that is as direct a calculation as I think it is. And of course, I could always have simply coded something wrong to begin with, GP-wise.

Here is a colab link to see the example in action

exowanderer commented 4 years ago

There's a completely unrelated problem that I've been encountering. I can enter either the inclination, impact parameters, or transit duration; but only one a time. [this is built in; no problem]

The problem arises that I've tried this exact same code on 3 machines, and ~10 times per machine. Depending on the machine and time of day (i.e. randomness), the KeplerianOrbit object does not like the form of one or two values in the set [inclination, impact parameter, duration] that I give it. Note that I give it the exact same values every time.

Every 3rd - 5th attempt results in an error that says something like "inclination is a bad value" (I'll edit that when I find it next). So I change to using duration or impact parameter. That will work for 3 - 5 more attempts, and then fail. At which point, I rotate through the list and it works again.

On the colab link, if you find an error related to [inc, b, or dur], then please goto the Cell 4 and change the inc for the dur or b as needed.

I should probably make this its own issue

dfm commented 4 years ago

That second comment should be a separate issue - feel free to open one!

There are a few issues with the first one:

First, you're including the likelihood twice:

        # The likelihood function assuming known Gaussian uncertainty
        pm.Normal("obs", mu=light_curve, sd=dataerr, observed=data)
        gp = build_gp_pink_noise(times, data, dataerr, log_Q=log_Q)
        gp.marginal("gp", observed=data)

The first one is the likelihood assuming white noise (with the quoted error bars) and the second one is assuming both white noise and a GP. The first one shouldn't be there at all!

The GP model currently isn't taking the transit light curve model into account so the values you're getting are meaningless. The easiest change would be to condition on the residuals (observed=data - light_curve) or you could provide a mean function to the GP:

def lc_model(t):
    light_curves = xo.LimbDarkLightCurve(u).get_light_curve(
        orbit=orbit, r=r, t=t
    )
    return pm.math.sum(light_curves, axis=-1) + mean

gp = xo.gp.GP(..., mean=lc_model)

Finally, I think that the number you actually want will be something like the integrated autocorrelation which would give you the total power in the GP (although I'm not very familiar with how "pink noise" is defined in the literature you're familiar with so it might take some more digging, perhaps @ericagol has more thoughts). If that is what you want then you'll want to compute something like:

2 * integral(kernel(dt), {dt, 0, infty})

Which, I think, for an exoplanet celerite kernel will be:

    ar, cr, ac, bc, cc, dc = kernel.coefficients
    variance = 2 * tt.sum(a / c)
    variance += 2 * tt.sum(ac * cc / (cc ** 2 + dc ** 2))
    variance += 2 * tt.sum(bc * dc / (cc ** 2 + dc ** 2))

but use this with a grain of salt for now and double check the math!

ericagol commented 4 years ago

My understanding from Carter & Winn is that pink noise has a 1/f power spectrum. However, this power spectrum isn't normalizable, so the ACF doesn't exist! In practice I think it is defined over some range of frequencies, which might correspond to the inverse exposure time and total duration of the data. Then, the ACF is defined in terms of the difference of inverse cosine integrals (which I gathered from some googling).