Open Lnaden opened 1 year ago
Script to reproduce conditions of both bugs is here:
import numpy as np
from pymbar import FES
u_kn = np.zeros((3, 3))
N_k = np.ones(3)
bin_edge = np.linspace(1, 3, 10)
x_n = np.linspace(0, 3, 3) + 0.1 # The top and bottom sample should not be assigned bins, the upper one is.
u_n = np.ones(3)
histogram_parameters = dict()
histogram_parameters["bin_edges"] = bin_edge
fes = FES(u_kn, N_k)
fes.generate_fes(u_n, x_n, fes_type="histogram", histogram_parameters=histogram_parameters)
bin_centers = bin_edge + (bin_edge[1] - bin_edge[0])/2
results = fes.get_fes(
bin_centers,
reference_point="from-specified",
fes_reference=70, # This should throw an error, it doesn't.
uncertainty_method="analytical",
)
This comment is for @mrshirts and @jchodera with relation to the implementation and theory in
fes.py
for samplesI managed to accidentally recreate the bug in #502 and have attached a minimal script to throw the error. However, in investigating what happened, I found 2 different bugs in the logic I wanted to discuss what the "correct" way to handle is.
Bug 1: Empty Bins
502 seems to throw an error when a bin in the bin does not have a sample. In theory, the logic here is supposed to ensure every bin has a sample but, because of the logic slightly above it to assign
bin_labels
it will only ever check bins with samples, so the sanity logic can never trigger, nor should it for bootstrap samples. This will not manifest until the user gets the histogram here because its iterating over the N-D grid of bins generated hereI'm not sure what the correct approach is here, as suggested in #502, we could just flag any bin without samples at the line and set a
np.nan
as we do right below checking the bin label, but I don't know if thats technically or scientifically correct here since we can estimate bins without samples using MBAR (although in an FES histogram bin, it might make sense tonan
the bin because it doesn't have samples.Thoughts?
Bug 2: Samples > Bin upper bound are assigned to the maximum bin
The logic used in calls to
np.digitize
are assigning samples greater than the upper bound of a dimension to the maximum bin instead of omitting it. This leads to a bias on the bins closest to the maximum in any dimension. This does NOT throw an error through, even though its scientifically wrong.The main function in question is
np.digitize
which does bin assignment for a sequence of samples and bin edges.The two problematic lines in question are this line
and this line
Where a value is measured, checked against bins, and then its
digitize
index is shifted down one to say that "sample X is assigned to bin index Y." Which is fine for any sample less than the max bin edge for given dimensiond
. The logic ofdigitize
isX < bins.min() --> 0
andX >= bins.max() --> len(bins)
. This means out of bounds on the low end is0
which downshifted is index-1
that is checked against in various places in the code. However out of bounds on the high end is downshifted tolen(bens) -1
which is a valid index forbins
and does NOT throw an error, assigning out of bounds samples to a bin.My proposed fix for this with minimal effort is to run the following logic:
and
The other two
digitize
calls in the code are captured here and also suffer this problem because they are allowing assignment of things out of bounds to the bin. They can be fixed in similar ways.Questions: Am I correct in identifying this is a bug? Is the proposed solution a good fix for it?
Also cc'ing @mikemhenry since he wanted to be tagged in this.