This comment is for @mrshirts and @jchodera with relation to the implementation and theory in fes.py for samples

I managed to accidentally recreate the bug in #502 and have attached a minimal script to throw the error. However, in investigating what happened, I found 2 different bugs in the logic I wanted to discuss what the "correct" way to handle is.

Bug 1: Empty Bins

502 seems to throw an error when a bin in the bin does not have a sample. In theory, the logic here is supposed to ensure every bin has a sample but, because of the logic slightly above it to assign `bin_labels` it will only ever check bins with samples, so the sanity logic can never trigger, nor should it for bootstrap samples. This will not manifest until the user gets the histogram here because its iterating over the N-D grid of bins generated here

I'm not sure what the correct approach is here, as suggested in #502, we could just flag any bin without samples at the line and set a np.nan as we do right below checking the bin label, but I don't know if thats technically or scientifically correct here since we can estimate bins without samples using MBAR (although in an FES histogram bin, it might make sense to nan the bin because it doesn't have samples.

Thoughts?

Bug 2: Samples > Bin upper bound are assigned to the maximum bin

The logic used in calls to np.digitize are assigning samples greater than the upper bound of a dimension to the maximum bin instead of omitting it. This leads to a bias on the bins closest to the maximum in any dimension. This does NOT throw an error through, even though its scientifically wrong.

The main function in question is np.digitize which does bin assignment for a sequence of samples and bin edges.

The two problematic lines in question are this line

bin_n[:, d] = np.digitize(x_n[:, d], bins[d]) - 1

and this line

fes_ref_grid[d] = np.digitize(fes_reference[d], bins[d]) - 1

Where a value is measured, checked against bins, and then its digitize index is shifted down one to say that "sample X is assigned to bin index Y." Which is fine for any sample less than the max bin edge for given dimension d. The logic of digitize is X < bins.min() --> 0 and X >= bins.max() --> len(bins). This means out of bounds on the low end is 0 which downshifted is index -1 that is checked against in various places in the code. However out of bounds on the high end is downshifted to len(bens) -1 which is a valid index for bins and does NOT throw an error, assigning out of bounds samples to a bin.

My proposed fix for this with minimal effort is to run the following logic:

bin_assignment =  np.digitize(x_n[:, d], bins[d])
bin_assignments[np.where(bin_assignments == len(bins[d]))] = 0  # Shift all out-of-upper-bounds bins to invalid
bin_n[:, d] = bin_assignments - 1  # reassign to bin index

and

fes_ref_bin =  np.digitize(fes_reference[d], bins[d]) 
fes_ref_bin[np.where(fes_ref_bin == len(bins[d]))] = 0  # Shift all out-of-upper-bounds bins to invalid
fes_ref_grid[d] = fes_ref_bin - 1  # reassign to bin index

The other two digitize calls in the code are captured here and also suffer this problem because they are allowing assignment of things out of bounds to the bin. They can be fixed in similar ways.

Questions: Am I correct in identifying this is a bug? Is the proposed solution a good fix for it?

Also cc'ing @mikemhenry since he wanted to be tagged in this.

choderalab / pymbar

FES incorrectly assigns samples larger than bin bounds, also does not handle empty bins correctly #511

Bug 1: Empty Bins

Bug 2: Samples > Bin upper bound are assigned to the maximum bin