Open jaanisfehling opened 4 weeks ago
It happens on this dataset: https://archive.ics.uci.edu/dataset/713/auction+verification I changed the code in emission/categorical.py at line 224:
def sample(self, class_no, n_samples):
pis = self.parameters["pis"].T
n_features = self.get_n_features()
feature_weights = pis[:, class_no].reshape(
n_features, self.parameters["max_n_outcomes"]
)
if np.isnan(feature_weights).any() or (feature_weights < 0).any() or (feature_weights > 1).any():
print(feature_weights)
raise ValueError("Probabilities must be non-negative and not NaN.")
X = np.array(
[
self.random_state.multinomial(1, feature_weights[k], size=n_samples)
for k in range(n_features)
]
)
X = np.argmax(X, axis=2) # Convert to integers
return X.T
I printed the feature weights and it looks like some random feature (always changing though) has the probability for the first outcome at 1, while the other outcomes have probabilities not equal 0.
Example feature_weights
:
[[3.54571654e-01 3.22714173e-01 3.22714173e-01 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[5.72262762e-03 2.19909888e-01 3.87183742e-01 3.87183742e-01
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[8.02896055e-02 6.15372367e-02 6.57794359e-02 6.57794359e-02
6.57794359e-02 6.57794359e-02 6.57794359e-02 6.57794359e-02
6.57794359e-02 6.57794359e-02 5.01022066e-02 1.31967595e-02
2.42142443e-02 2.42142443e-02 2.42142443e-02 2.42142443e-02
2.42142443e-02 2.42142443e-02 2.42142443e-02 2.42142443e-02
2.42142443e-02 5.76776316e-16 5.63450065e-03 5.63450065e-03
5.63450065e-03 5.63450065e-03 5.63450065e-03 5.63450065e-03
5.63450065e-03 5.63450065e-03 5.63450065e-03 5.16712205e-16]
[3.00957067e-01 2.94573371e-01 3.88412944e-10 2.01455888e-01
5.38628585e-16 2.03013674e-01 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[1.00000000e+00 2.77882835e-17 2.44487462e-16 2.51879007e-16
1.62340135e-16 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]
Thanks for reporting this.
The warnings suggest that the estimation is diverging and likely introducing NaNs in the parameters, which triggers errors in the sampling function. One way to address this is to rely on the simpler gaussian_unit
model, which tends to be more stable.
I also see that you are using the blrt_sweep
function. Can you reproduce the error outside of this with a single estimator?
Regarding your second post on the sum of probabilities: are you sure this is triggering the error? This is likely the result of numerical calculations: the probabilities won't always perfectly sum to 1. For example, I tested the numpy multinomial sampling with the distribution [1. , 1e-16, 1e-16]
and it works fine on my end.
Thanks for the help. I will try different models, but in my testing the gaussian_sperical
performed best (maybe because it is the most robust to outliers).
I could not get blrt_sweep
to work unfortunately. Outside of that, my code worked most of the time, sometimes it did not converge and gave really bad results, I fixed this by falling back to k-means clustering in those cases.
I am gettintg
ValueError: pvals < 0, pvals > 1 or pvals contains NaNs
while running this codeFull Traceback:
So apparently its related to the numpy random multinomial function.
Runtime warnings I also get: