Teichlab / bin2cell

Join subcellular Visium HD bins into cells
MIT License
47 stars 1 forks source link

expand_labels ValueError: Input contains infinity or a value too large for dtype('float64') #19

Open t-a-m-i opened 3 days ago

t-a-m-i commented 3 days ago

Hi, I get ValueError: Input contains infinity or a value too large for dtype('float64') when trying to run following on my data:

b2c.expand_labels(ydata, 
                  max_bin_distance=2, 
                  labels_key='labels_he', 
                  expanded_labels_key="labels_he_expanded"
                 )

Full traceback:

Cell In[8], line 1 ----> 1 b2c.expand_labels(ydata, 2 max_bin_distance=2, 3 labels_key='labels_he', 4 expanded_labels_key="labels_he_expanded" 5 )

File ~/anaconda3/envs/bin2cell/lib/python3.12/site-packages/bin2cell/bin2cell.py:961, in expand_labels(adata, labels_key, expanded_labels_key, max_bin_distance, volume_ratio, k, subset_pca) 957 smol = np.unique(np.concatenate([hits[ambiguous_mask,:].flatten(), ambiguous_query_inds])) 958 #prepare a PCA as a representation of the GEX space for solving ties 959 #can just run straight on an array to get a PCA matrix back. convenient! 960 #keep the object's X raw for subsequent cell creation --> 961 pca_smol = sc.pp.pca(np.log1p(adata.X[smol, :])) 962 #mock up a "full-scale" PCA matrix to not have to worry about different indices 963 pca = np.zeros((adata.shape[0], pca_smol.shape[1]))

File ~/anaconda3/envs/bin2cell/lib/python3.12/site-packages/scanpy/preprocessing/_pca.py:286, in pca(failed resolving arguments) 282 from sklearn.decomposition import PCA 284 svd_solver = _handle_sklearn_args(svd_solver, "PCA (with sparse input)") --> 286 output = _pca_with_sparse( 287 X, n_comps, solver=svd_solver, random_state=random_state 288 ) 289 # this is just a wrapper for the results 290 X_pca = output["X_pca"]

File ~/anaconda3/envs/bin2cell/lib/python3.12/site-packages/scanpy/preprocessing/_pca.py:415, in _pca_with_sparse(X, n_pcs, solver, mu, random_state) 413 np.random.set_state(random_state.get_state()) 414 random_init = np.random.rand(np.min(X.shape)) --> 415 X = check_array(X, accept_sparse=["csr", "csc"]) 417 if mu is None: 418 mu = np.asarray(X.mean(0)).flatten()[None, :]

File ~/anaconda3/envs/bin2cell/lib/python3.12/site-packages/sklearn/utils/validation.py:971, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name) 969 if sp.issparse(array): 970 _ensure_no_complex_data(array) --> 971 array = _ensure_sparse_format( 972 array, 973 accept_sparse=accept_sparse, 974 dtype=dtype, 975 copy=copy, 976 force_all_finite=force_all_finite, 977 accept_large_sparse=accept_large_sparse, 978 estimator_name=estimator_name, 979 input_name=input_name, 980 ) 981 if ensure_2d and array.ndim < 2: 982 raise ValueError( 983 f"Expected 2D input, got input with shape {array.shape}.\n" 984 "Reshape your data either using array.reshape(-1, 1) if " 985 "your data has a single feature or array.reshape(1, -1) " 986 "if it contains a single sample." 987 )

File ~/anaconda3/envs/bin2cell/lib/python3.12/site-packages/sklearn/utils/validation.py:631, in _ensure_sparse_format(sparse_container, accept_sparse, dtype, copy, force_all_finite, accept_large_sparse, estimator_name, input_name) 626 warnings.warn( 627 f"Can't check {sparse_container.format} sparse matrix for nan or inf.", 628 stacklevel=2, 629 ) 630 else: --> 631 _assert_all_finite( 632 sparse_container.data, 633 allow_nan=force_all_finite == "allow-nan", 634 estimator_name=estimator_name, 635 input_name=input_name, 636 ) 638 # TODO: Remove when the minimum version of SciPy supported is 1.12 639 # With SciPy sparse arrays, conversion from DIA format to COO, CSR, or BSR 640 # triggers the use of np.int64 indices even if the data is such that it could (...) 643 # algorithms support large indices, the following code downcasts to np.int32 644 # indices when it's safe to do so. 645 if changed_format: 646 # accept_sparse is specified to a specific format and a conversion occurred

File ~/anaconda3/envs/bin2cell/lib/python3.12/site-packages/sklearn/utils/validation.py:123, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name) 120 if first_pass_isfinite: 121 return --> 123 _assert_all_finite_element_wise( 124 X, 125 xp=xp, 126 allow_nan=allow_nan, 127 msg_dtype=msg_dtype, 128 estimator_name=estimator_name, 129 input_name=input_name, 130 )

File ~/anaconda3/envs/bin2cell/lib/python3.12/site-packages/sklearn/utils/validation.py:172, in _assert_all_finite_element_wise(X, xp, allow_nan, msg_dtype, estimator_name, input_name) 155 if estimator_name and input_name == "X" and has_nan_error: 156 # Improve the error message on how to handle missing values in 157 # scikit-learn. 158 msg_err += ( 159 f"\n{estimator_name} does not accept missing values" 160 " encoded as NaN natively. For supervised learning, you might want" (...) 170 "#estimators-that-handle-nan-values" 171 ) --> 172 raise ValueError(msg_err)

ValueError: Input contains infinity or a value too large for dtype('float64').

I do have Infinites in ydata.X, but that wasn´t a problem in my other datasets. The exact same code with infinites being present in their ydata.X worked perfectly fine. Wondering what could be the issue here.

Thanks in advance!

ktpolanski commented 3 days ago

Having infinites in your .X sounds fundamentally incorrect.

My best guess as to why your other samples did not explode is that the bins with the infinites weren't part of the conflict resolution, and didn't make it to the PCA.

Not a bin2cell bug. Pursue with scanpy devs, or better yet, don't have infinites in your .X.

ktpolanski commented 3 days ago

Actually wait. How did you get the infinites in .X in the first place?

t-a-m-i commented 2 days ago

So I just checked and they appear after b2c.destripe. However, this is only the case if sc.pp.filter_cells(ydata, min_counts=0) is called before. I know one would usually filter by min_count=1, but I have a special case and reason why I don´t.

ktpolanski commented 2 days ago

I did some experimenting with b2c.destripe() yesterday and was able to get it to produce infinites in certain situations after sc.pp.filter_cells(adata, min_counts=0). This is not a standard use case, as you yourself acknowledge - bins without expression are not conventionally useful. If I were to add acknowledgment of this to the code, I'd have destripe error out upon encountering 0s. I don't think you want that.

You could try the following as a workaround, assuming a freshly loaded object after sc.pp.filter_cells(adata, min_counts=0):

bdata = sc.pp.filter_cells(adata, min_counts=1, copy=True)
b2c.destripe(bdata)
adata[bdata.obs_names].X = bdata.X
adata.obs["n_counts_adjusted"] = 0
adata.obs.loc[bdata.obs_names, "n_counts_adjusted"] = bdata.obs["n_counts_adjusted"]