blueprints-for-text-analytics-python / blueprints-text

Jupyter notebooks for our O'Reilly book "Blueprints for Text Analysis Using Python"
Apache License 2.0
250 stars 141 forks source link

Chapter 8 Part: Creating a topic model using NMF for documents #4

Closed pvanhuisstede closed 3 years ago

pvanhuisstede commented 3 years ago

When I try to run the following line of code:

W_text_matrix = nmf_text_model.fit_transform(tfidf_text_vectors)

I get the following error: ValueError: array must not contain infs or NaNs

As far as I can see there aren't any.

datanizing commented 3 years ago

Hi Peter,

thanks for your report. Are you running the notebooks locally or on Colab?

tfidf_text_vectors is a sparse matrix created by scikit-learn and must not contain any NaN.

What happens if you restart the notebook and run all cells?

Thanks and regards Christian

pvanhuisstede commented 3 years ago

Hi Christian,

I was working in Spyder, just following along with the examples. I just ran the chapter 8 notebook on my computer and I get the same error in cell 11 of the notebook, line 4 where nmf_text_model.fit_transform(tfidf_text_vectors) is called. It is quite a stacktrace with the last line: ValueError: array must not contain infs or NaNs.

From what I read I suspect it is an 'infs' problem, but I find it difficult to pinpoint the issue: How does one inspect such a sparse matrix?

Below the complete stacktrace from the notebook:

ValueError Traceback (most recent call last) ~/Documents/code/python/blueprints-text/ch08/setup.py in 2 3 nmf_text_model = NMF(n_components=10, random_state=42) ----> 4 W_text_matrix = nmf_text_model.fit_transform(tfidf_text_vectors) 5 H_text_matrix = nmf_textmodel.components

~/anaconda3/lib/python3.8/site-packages/sklearn/decomposition/_nmf.py in fit_transform(self, X, y, W, H) 1309 1310 with config_context(assume_finite=True): -> 1311 W, H, niter = non_negative_factorization( 1312 X=X, W=W, H=H, n_components=self.n_components, init=self.init, 1313 update_H=True, solver=self.solver, beta_loss=self.beta_loss,

~/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, *kwargs) 61 extra_args = len(args) - len(all_args) 62 if extra_args <= 0: ---> 63 return f(args, **kwargs) 64 65 # extra_args > 0

~/anaconda3/lib/python3.8/site-packages/sklearn/decomposition/_nmf.py in non_negative_factorization(X, W, H, n_components, init, update_H, solver, beta_loss, tol, max_iter, alpha, l1_ratio, regularization, random_state, verbose, shuffle) 1064 W = np.zeros((n_samples, n_components), dtype=X.dtype) 1065 else: -> 1066 W, H = _initialize_nmf(X, n_components, init=init, 1067 random_state=random_state) 1068

~/anaconda3/lib/python3.8/site-packages/sklearn/decomposition/_nmf.py in _initialize_nmf(X, n_components, init, eps, random_state) 344 345 # NNDSVD initialization --> 346 U, S, V = randomized_svd(X, n_components, random_state=random_state) 347 W = np.zeros_like(U) 348 H = np.zeros_like(V)

~/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, *kwargs) 61 extra_args = len(args) - len(all_args) 62 if extra_args <= 0: ---> 63 return f(args, **kwargs) 64 65 # extra_args > 0

~/anaconda3/lib/python3.8/site-packages/sklearn/utils/extmath.py in randomized_svd(M, n_components, n_oversamples, n_iter, power_iteration_normalizer, transpose, flip_sign, random_state) 355 356 # compute the SVD on the thin matrix: (k + p) wide --> 357 Uhat, s, Vt = linalg.svd(B, full_matrices=False) 358 359 del B

~/anaconda3/lib/python3.8/site-packages/scipy/linalg/decomp_svd.py in svd(a, full_matrices, compute_uv, overwrite_a, check_finite, lapack_driver) 104 105 """ --> 106 a1 = _asarray_validated(a, check_finite=check_finite) 107 if len(a1.shape) != 2: 108 raise ValueError('expected matrix')

~/anaconda3/lib/python3.8/site-packages/scipy/_lib/_util.py in _asarray_validated(a, check_finite, sparse_ok, objects_ok, mask_ok, as_inexact) 260 raise ValueError('masked arrays are not supported') 261 toarray = np.asarray_chkfinite if check_finite else np.asarray --> 262 a = toarray(a) 263 if not objects_ok: 264 if a.dtype is np.dtype('O'):

~/anaconda3/lib/python3.8/site-packages/numpy/lib/function_base.py in asarray_chkfinite(a, dtype, order) 486 a = asarray(a, dtype=dtype, order=order) 487 if a.dtype.char in typecodes['AllFloat'] and not np.isfinite(a).all(): --> 488 raise ValueError( 489 "array must not contain infs or NaNs") 490 return a

ValueError: array must not contain infs or NaNs

datanizing commented 3 years ago

Hi Peter,

thanks for your reply. Impressive stacktrace!

The size of tfidf_text_vectors is given by its shape, should be (7507, 24611). If that dimensions are very different, something is wrong with the data.

You can check the array values with tfidf_text_vectors.data or directly check for NaN:

import numpy
np.isnan(tfidf_text_vectors.data).any()

Hope this helps in isolating the problem.

Regards Christian

pvanhuisstede commented 3 years ago

Hi Christian,

that part seems ok:

print(tfidf_text_vectors.shape) => (7505, 24730) print(tfidf_text_vectors.data) => [0.02111632 0.04041201 0.02162275 ... 0.0111272 0.01603944 0.03277555] print(np.isnan(tfidf_text_vectors.data).any()) => False

My guess: Doing the matrix factorization we generate values that are infs?

Best,

Peter

datanizing commented 3 years ago

Hi Peter,

thanks for trying that!

That is really strange. I think I have created thousands of topic models with NMF and never have had this problem a single time. I have run the notebook on my local installation - works flawlessly. Maybe there is something wrong with your Anaconda and dependencies?

Could you try on Google Colab? As another alternative, install try a fresh Anaconda install (create a different user on MacOS to keep your current installation).

Sorry that my hints are so generic, but it does not look like a problem which is directly related to the data or the code.

Regards Christian

datanizing commented 3 years ago

Ah, and maybe you can try the paragraph model. It would be interesting if that is working for you or if you get the same error.

pvanhuisstede commented 3 years ago

I just got it working with a small sample of the data, so the error must be in the data. I will try to track it down. Sorry to have bothered you with this. Keep you posted.

Best,

Peter

datanizing commented 3 years ago

Hi Peter,

this is actually a very good find and thanks again for bringing that up with all the details!

If it's related to the data, many other people will have similar problems eventually. I would be very glad if we can resolve this together.

Thanks for your help and effort!

Regards Christian

pvanhuisstede commented 3 years ago

Dear Christian,

got the example running in the end, but then on running the code again, got the nans & infs error again. This morning I re-installed Anaconda and got the example running repeatedly. So, nothing to do with the data I guess, but with my previous Anaconda setup.