Closed pvanhuisstede closed 3 years ago
Hi Peter,
thanks for your report. Are you running the notebooks locally or on Colab?
tfidf_text_vectors
is a sparse matrix created by scikit-learn
and must not contain any NaN
.
What happens if you restart the notebook and run all cells?
Thanks and regards Christian
Hi Christian,
I was working in Spyder, just following along with the examples. I just ran the chapter 8 notebook on my computer and I get the same error in cell 11 of the notebook, line 4 where nmf_text_model.fit_transform(tfidf_text_vectors) is called. It is quite a stacktrace with the last line: ValueError: array must not contain infs or NaNs.
From what I read I suspect it is an 'infs' problem, but I find it difficult to pinpoint the issue: How does one inspect such a sparse matrix?
ValueError Traceback (most recent call last)
~/Documents/code/python/blueprints-text/ch08/setup.py in
~/anaconda3/lib/python3.8/site-packages/sklearn/decomposition/_nmf.py in fit_transform(self, X, y, W, H) 1309 1310 with config_context(assume_finite=True): -> 1311 W, H, niter = non_negative_factorization( 1312 X=X, W=W, H=H, n_components=self.n_components, init=self.init, 1313 update_H=True, solver=self.solver, beta_loss=self.beta_loss,
~/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, *kwargs) 61 extra_args = len(args) - len(all_args) 62 if extra_args <= 0: ---> 63 return f(args, **kwargs) 64 65 # extra_args > 0
~/anaconda3/lib/python3.8/site-packages/sklearn/decomposition/_nmf.py in non_negative_factorization(X, W, H, n_components, init, update_H, solver, beta_loss, tol, max_iter, alpha, l1_ratio, regularization, random_state, verbose, shuffle) 1064 W = np.zeros((n_samples, n_components), dtype=X.dtype) 1065 else: -> 1066 W, H = _initialize_nmf(X, n_components, init=init, 1067 random_state=random_state) 1068
~/anaconda3/lib/python3.8/site-packages/sklearn/decomposition/_nmf.py in _initialize_nmf(X, n_components, init, eps, random_state) 344 345 # NNDSVD initialization --> 346 U, S, V = randomized_svd(X, n_components, random_state=random_state) 347 W = np.zeros_like(U) 348 H = np.zeros_like(V)
~/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, *kwargs) 61 extra_args = len(args) - len(all_args) 62 if extra_args <= 0: ---> 63 return f(args, **kwargs) 64 65 # extra_args > 0
~/anaconda3/lib/python3.8/site-packages/sklearn/utils/extmath.py in randomized_svd(M, n_components, n_oversamples, n_iter, power_iteration_normalizer, transpose, flip_sign, random_state) 355 356 # compute the SVD on the thin matrix: (k + p) wide --> 357 Uhat, s, Vt = linalg.svd(B, full_matrices=False) 358 359 del B
~/anaconda3/lib/python3.8/site-packages/scipy/linalg/decomp_svd.py in svd(a, full_matrices, compute_uv, overwrite_a, check_finite, lapack_driver) 104 105 """ --> 106 a1 = _asarray_validated(a, check_finite=check_finite) 107 if len(a1.shape) != 2: 108 raise ValueError('expected matrix')
~/anaconda3/lib/python3.8/site-packages/scipy/_lib/_util.py in _asarray_validated(a, check_finite, sparse_ok, objects_ok, mask_ok, as_inexact) 260 raise ValueError('masked arrays are not supported') 261 toarray = np.asarray_chkfinite if check_finite else np.asarray --> 262 a = toarray(a) 263 if not objects_ok: 264 if a.dtype is np.dtype('O'):
~/anaconda3/lib/python3.8/site-packages/numpy/lib/function_base.py in asarray_chkfinite(a, dtype, order) 486 a = asarray(a, dtype=dtype, order=order) 487 if a.dtype.char in typecodes['AllFloat'] and not np.isfinite(a).all(): --> 488 raise ValueError( 489 "array must not contain infs or NaNs") 490 return a
ValueError: array must not contain infs or NaNs
Hi Peter,
thanks for your reply. Impressive stacktrace!
The size of tfidf_text_vectors
is given by its shape, should be (7507, 24611)
. If that dimensions are very different, something is wrong with the data.
You can check the array values with tfidf_text_vectors.data
or directly check for NaN
:
import numpy
np.isnan(tfidf_text_vectors.data).any()
Hope this helps in isolating the problem.
Regards Christian
Hi Christian,
that part seems ok:
print(tfidf_text_vectors.shape) => (7505, 24730) print(tfidf_text_vectors.data) => [0.02111632 0.04041201 0.02162275 ... 0.0111272 0.01603944 0.03277555] print(np.isnan(tfidf_text_vectors.data).any()) => False
My guess: Doing the matrix factorization we generate values that are infs?
Best,
Peter
Hi Peter,
thanks for trying that!
That is really strange. I think I have created thousands of topic models with NMF
and never have had this problem a single time. I have run the notebook on my local installation - works flawlessly. Maybe there is something wrong with your Anaconda and dependencies?
Could you try on Google Colab? As another alternative, install try a fresh Anaconda install (create a different user on MacOS to keep your current installation).
Sorry that my hints are so generic, but it does not look like a problem which is directly related to the data or the code.
Regards Christian
Ah, and maybe you can try the paragraph model. It would be interesting if that is working for you or if you get the same error.
I just got it working with a small sample of the data, so the error must be in the data. I will try to track it down. Sorry to have bothered you with this. Keep you posted.
Best,
Peter
Hi Peter,
this is actually a very good find and thanks again for bringing that up with all the details!
If it's related to the data, many other people will have similar problems eventually. I would be very glad if we can resolve this together.
Thanks for your help and effort!
Regards Christian
Dear Christian,
got the example running in the end, but then on running the code again, got the nans & infs error again. This morning I re-installed Anaconda and got the example running repeatedly. So, nothing to do with the data I guess, but with my previous Anaconda setup.
When I try to run the following line of code:
W_text_matrix = nmf_text_model.fit_transform(tfidf_text_vectors)
I get the following error: ValueError: array must not contain infs or NaNs
As far as I can see there aren't any.