Cross check Dingo and bilby likelihoods

max-dax commented 1 year ago

We observe small deviations between Dingo and published bilby results. This happens even when the Dingo-IS sample efficiency is large. One possible explanation is that we use a different likelihood (e.g., missing a window factor somewhere). We should therefore check that the Dingo likelihood agrees with bilby likelihoods.

Ways to track this down:

Take some bilby samples (these include their likelihoods), insert them into a Dingo results file (make sure to provide the appropriate strains, PSDs, domain settings, ...) and run importance sampling with Dingo. Check that the bilby likelihoods match those inferred with Dingo. I did this with GW150914 and the published bilby results, for which we see deviations of O(1), which could be due to calibration marginalization. We should check that we get exact matches when bilby does not apply calibration marginalization.
Run bilby_pipe for an event for which we have dingo samples. Step through the likelihood calculation with the pycharm debugger (or a similar tool). Check that (a) the data is identical to what Dingo uses (e.g., make sure that the PSD is not off by a window factor, the conventions are not always consistent) and (b) that the computed waveforms are identical to Dingo.

There is also this script https://github.com/dingo-gw/dingo-devel/blob/main/misc_scripts/run_bilby.py to run Dingo from a yaml file.

nihargupte-ph commented 1 year ago

I've been looking into this as we discussed, still working on the IS part but one thing I've noticed is that the delta_f is 0.00390625 for the pBilby samples whereas we usually use 0.125. This seems quite different, perhaps it is referring to some other delta_f. I found this delta_f in the official p_bilby samples (https://zenodo.org/record/6513631/files/IGWN-GWTC2p1-v2-GW150914_095045_PEDataRelease_mixed_cosmo.h5?download=1) but running the following code:

import pandas as pd 
import bilby
import h5py
import numpy as np
import os
import re

# Loading IGWN bilby samples
def h5_to_dict(h5_object, dictionary):
    """Recursive function to convert an h5py object to a dictionary"""
    for key in h5_object.keys():
        item = h5_object[key]
        if isinstance(item, h5py._hl.dataset.Dataset):
            dictionary[key] = item[()]
        elif isinstance(item, h5py._hl.group.Group):
            dictionary[key] = {}
            h5_to_dict(item, dictionary[key])
    return dictionary

# Settings a base_dir for this notebook
base_dir = "/data/nihargupte/projects/hotfixes/dingo-devel/devel_testing/bilby_offset"

# Open the .h5 file
file = os.path.join(base_dir, "IGWN-GWTC2p1-v2-GW150914_095045_PEDataRelease_mixed_cosmo.h5")
f = h5py.File(file, 'r')

# Convert the .h5 file to a dictionary
bilby_result = h5_to_dict(f, {})
# calibration_envelope = 
imr_bilby_result = bilby_result["C01:IMRPhenomXPHM"]

# Close the file
f.close()

keys = imr_bilby_result["posterior_samples"].dtype.names
arr = np.stack([np.array([imr_bilby_result["posterior_samples"][i][j] for j in range(len(keys))]) for i in range(imr_bilby_result["posterior_samples"].shape[0])])
imr_bilby_result["posterior_samples"] = pd.DataFrame(columns=keys, data=arr)
bilby_result['C01:IMRPhenomXPHM']["posterior_samples"] = pd.DataFrame(columns=keys, data=arr)

print(float(bilby_result['C01:IMRPhenomXPHM']["meta_data"]["meta_data"]["delta_f"]))

I will report more after doing the IS with calibration marginalization, for now, I am just making sure all the settings in the pBilby samples are properly converted into a DingoDataset.

mpuerrer commented 1 year ago

Curious. It's not an arbitrary number:1/0.00390625 = 256s, but a segment length of 256s would be completely overkill for a short <1 s BBH like GW150914, so chances are it is referring to something else.

stephengreen commented 1 year ago

It looks like this file was created by pesummary. I don't know precisely how the meta_data are generated, but I agree that delta_f = 0.00390625 does not make sense. In fact, elsewhere in the file it specifies duration = 4.0. Vivien is one of the main authors of pesummary so maybe it is worth asking him where these meta-data come from.

stephengreen commented 1 year ago

Note also that the file can be loaded using

from pesummary.io import read
data = read("IGWN-GWTC2p1-v2-GW150914_095045_PEDataRelease_mixed_cosmo.h5")

nihargupte-ph commented 1 year ago

AH thank you, I've been putting everything in a Dict till now. Should've figured it was PESummary.

nihargupte-ph commented 1 year ago

Been working on this a bit more now that I am in the office...

Interesting notes:

The GPS time and duration are 1126259460.3909912 and 4s as opposed to the GWOSC 1126259462.4 and 8s that we use for inference. I don't see any buffer time in the data. bilby_result["C01:IMRPhenomXPHM"]["meta_data"]["meta_data"]["sampling_frequency"] = 2048.0

I am trying to importance sample with the GPS time given, does anyone know if GWOSC is down at the moment?

TimeSeries.fetch_open_data(
   "H1",
   1126259460,
   1126259464
)

Hangs for hours. Once this is done I can run with calibration marginalization.

stephengreen commented 1 year ago

@max-dax Which geocent_time did you use in the Dingo analysis of this event for the review?

nihargupte-ph commented 1 year ago

A quick update on this, I have run importance sampling with calibration marginalization and now am in the process of comparing the likelihoods for the official samples. For Dingo I am getting likelihoods in the range (-16947, -16020) whereas Bilby has likelihoods in the range of (305, 328). My initial thought was that this is because the Bilby likelihoods are unnormalized but after normalizing (in code):

from scipy.special import logsumexp

# Loading data ignore if you are using PESummary instead
# Loading IGWN bilby samples
def h5_to_dict(h5_object, dictionary):
    """Recursive function to convert an h5py object to a dictionary"""
    for key in h5_object.keys():
        item = h5_object[key]
        if isinstance(item, h5py._hl.dataset.Dataset):
            dictionary[key] = item[()]
        elif isinstance(item, h5py._hl.group.Group):
            dictionary[key] = {}
            h5_to_dict(item, dictionary[key])
    return dictionary

# Open the .h5 file
fname = "IGWN-GWTC2p1-v2-GW150914_095045_PEDataRelease_mixed_cosmo.h5"
f = h5py.File(fname, 'r')

# Convert the .h5 file to a dictionary
bilby_result = h5_to_dict(f, {})

# Close the file
f.close()

bilby_result['C01:IMRPhenomXPHM']["posterior_samples"] = pd.DataFrame.from_records(data=bilby_result['C01:IMRPhenomXPHM']["posterior_samples"])

bilby_log_likelihood = bilby_result['C01:IMRPhenomXPHM']["posterior_samples"]["log_likelihood"]
normalized_bilby_log_likelihoods = bilby_log_likelihood - logsumexp(bilby_log_likelihood)

I am getting log_likelihoods in the range (-30, -7), so still very different. Probably I am missing the convention on how the Bilby samples are stored. @max-dax when you did the comparison without calibration marginalization did you have this normalization difference as well?

stephengreen commented 1 year ago

Those Bilby likelihoods sound like they are not including the log noise evidence (see, e.g., https://github.com/dingo-gw/dingo-devel/blob/main/dingo/gw/likelihood.py#L254). For the saved Bilby result, this is

In [17]:  f['C01:IMRPhenomXPHM']['meta_data']['sampler']['ln_noise_evidence'][0]
Out[17]: -7288.12

Adding this in gets us halfway to the ~ -16k log likelihoods you reported. This is probably because the Bilby analysis used duration = 4.0 and your analysis used 8.0.

nihargupte-ph commented 1 year ago

Looking into this more after our previous discussions. Even after correcting for the correct duration and using the correct f_max it seems like the log-likelihoods were still off by ~300. For the following examples I am just taking one injection point.

(dingo likelihood, bilby likelihood) were (-7326.554948013703, -6965.81572904715).

As a check, we decided to look into the log noise evidence computed by Bilby but inputting the same data we use for the dingo. It looks like the dingo and bilby log-likelihoods are indeed different.

(dingo noise log-likelihood, bilby noise log-likelihood) is (-7027.95, -6083.2828).

Stepping into this with a debugger a bit I found that this difference arises due to the fact that the noise-weighted inner products differ by a window factor. If I then specify the window factor to Bilby I get the same noise log-likelihood!!

(dingo noise log-likelihood, bilby noise log-likelihood) is (-7027.95, -7027.95).

Ok so if the window factors are the same we have the same noise log likelihood.

$$ \log L = \log Z_n + \kappa2 - \frac{1}{2} \rho{opt}^2 $$

From dingo for a particular set of injection parameters we find $\kappa2 = 12.4$ and $\frac{-1}{2} \rho{opt}^2 = -316.9$. I am now cross-checking the other terms in the log-likelihood. That being said, when I now find the log likelihood with bilby I get -7338.56 which is different than what's in the file (-6965) and also different from Dingo (-7327). But at least we can compare to dingo likelihoods.

nihargupte-ph commented 1 year ago

As a review we are finding a difference in likelihood of ~12.

Going through with a debugger more, I find that Dingo also agrees with the value for $rho_{opt}^2$. This means both Dingo and Bilby are getting the same data correctly since we have confirmed $\langle d, d \rangle$ and $\langle \mu, \mu \rangle$ are the same. However, it disagrees with $\kappa^2$.

(dingo $\kappa^2$, bilby $\kappa^2$) were (12.001, 0.326)

Looking into this more this is caused by the following difference in the codes. In line 395 of https://github.com/lscsoft/bilby/blob/master/bilby/gw/likelihood.py bilby is summing $\langle d, h \rangle$ across the interferometers. In the code we have something like $(-7.6+2.0j) + (7.9+4.8j)$. So when later taking the real part Bilby gets 0.32.

However, Dingo when executing this sum gets -28.0 + 40.02.

This will take a bit to untangle but working on it...

max-dax commented 1 year ago

Good job. Note that $\langle d, d \rangle$ and $\langle \mu, \mu \rangle$ are insensitive to phase/time shifts. This will only show up in $\kappa^2 = \langle d, \mu \rangle$. It is thus possible that we have an incorrect time/phase shift in either the data or the computed waveform.

nihargupte-ph commented 1 year ago

Fixed with https://github.com/dingo-gw/dingo/pull/180. Finally!

dingo-gw / dingo

Cross check Dingo and bilby likelihoods #132