Closed ScreachingFire closed 1 year ago
Additional notes: I checked for NaNs and Infinities before calling cello.scanpy_cello() and could find none, so I am thinking that these values are popping up while the function is being called.
Hi, I am sorry you are running into this issue. Thank you for providing the traceback. It looks to me like the error is occurring inside the PCA function that is implemented by scikit-learn. Specifically, it looks to me like it is breaking due to a NaN value. Would it be possible to send me either the code you are using to normalize the data, or, if possible, the dataset that is producing these errors?
Would it be possible to send me either the code you are using to normalize the data, or, if possible, the dataset that is producing these errors?
Dataset: https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE143704&format=file&file=GSE143704%5FDeMicheli%5FHumanMuscleAtlas%5Frawdata%2Etxt%2Egz Code:
adata = sc.read_text('/Users/name/Documents/Bio Data/Human/SkeletalMuscle/Dataset2(22kCell)/GSE143704_DeMicheli_HumanMuscleAtlas_rawdata.txt').transpose()
adata.var_names_make_unique()
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
sc.pp.filter_cells(adata, max_counts=5050)
sc.pp.filter_cells(adata, min_genes=510)
adata = adata[adata.obs.pct_counts_mt < 22, :]
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata, base=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5, n_bins=20, n_top_genes=7500)
adata = adata[:, adata.var.highly_variable]
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack', n_comps=60)
More notes: Prior to Transforming with PCA... I do get two more errors(other than the 'variables are not unique' which is not important), they don't stop the program so I assumed they may be insignificant but now I am thinking otherwise:
/p-scRNAseq/lib/python3.9/site-packages/cello/cello.py:459: RuntimeWarning: invalid value encountered in log
x_clust = np.log(x_clust+1)
/p-scRNAseq/lib/python3.9/site-packages/cello/cello.py:477: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass
AnnData(X, dtype=X.dtype, ...)to get the future behavour.
ad_mean_clust = AnnData(
adata.X.dtype prior to calling cello.scanpy_cello() is still float32 so I don't know why the second error is popping up.
Would it be possible to send me either the code you are using to normalize the data, or, if possible, the dataset that is producing these errors?
Dataset: https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE143704&format=file&file=GSE143704%5FDeMicheli%5FHumanMuscleAtlas%5Frawdata%2Etxt%2Egz Code:
adata = sc.read_text('/Users/name/Documents/Bio Data/Human/SkeletalMuscle/Dataset2(22kCell)/GSE143704_DeMicheli_HumanMuscleAtlas_rawdata.txt').transpose() adata.var_names_make_unique() adata.var['mt'] = adata.var_names.str.startswith('MT-') sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True) sc.pp.filter_cells(adata, max_counts=5050) sc.pp.filter_cells(adata, min_genes=510) adata = adata[adata.obs.pct_counts_mt < 22, :] sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata, base=10) sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5, n_bins=20, n_top_genes=7500) adata = adata[:, adata.var.highly_variable] sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt']) sc.pp.scale(adata, max_value=10) sc.tl.pca(adata, svd_solver='arpack', n_comps=60)
More notes: Prior to Transforming with PCA... I do get two more errors(other than the 'variables are not unique' which is not important), they don't stop the program so I assumed they may be insignificant but now I am thinking otherwise:
/p-scRNAseq/lib/python3.9/site-packages/cello/cello.py:459: RuntimeWarning: invalid value encountered in log
x_clust = np.log(x_clust+1)
/p-scRNAseq/lib/python3.9/site-packages/cello/cello.py:477: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass
AnnData(X, dtype=X.dtype, ...)to get the future behavour.
ad_mean_clust = AnnData(
adata.X.dtype prior to calling cello.scanpy_cello() is still float32 so I don't know why the second error is popping up.
I encountered a similar issue to this, and realised that the step on regressing out pct_counts_mt (sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
) was the culprit here as it would cause the already log1p-transformed values to be shifted away to more negative values, whereas the _aggregate_expression
helper function assumes that all values are the pre-regression counts (cello.py line 455: x_clust = np.squeeze(np.array(np.sum(X, axis=0)))
).
Skipping the regress_out step resolves this issue.
I'm not entirely sure how this can be addressed though - @mbernste is this issue already known?
Hello DeweyLab,
I am working with a dataset of ~22k cells and ~22k genes, and after generating a model with cello.scanpy_cello(), I keep getting an error when transforming with PCA:
Transforming with PCA... Traceback (most recent call last): File "/Users/name/Documents/Bio Scripts/Python/scRNAseq & RNAseq Analysis/Human/Skeletal Muscle/SkelMuscleAnalysis", line 178, in
cello.scanpy_cello(adata, clust_key='leiden', rsrc_loc=cello_resource_loc, model_file=f'{two_model_prefix}.model.dill')
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/cello/scanpy_cello.py", line 146, in cello
results_df, finalized_binary_results_df, ms_results_df = ce.predict(
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/cello/cello.py", line 232, in predict
results_df, cell_to_clust = _raw_probabilities(
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/cello/cello.py", line 437, in _raw_probabilities
conf_df, score_df = mod.predict(expr, ad_clust.obs.index)
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/cello/models/model.py", line 89, in predict
X = self._preprocess(X)
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/cello/models/model.py", line 85, in _preprocess
X = prep.transform(X)
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/cello/models/pca.py", line 40, in transform
X = self.model.transform(X)
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/sklearn/decomposition/_base.py", line 117, in transform
X = self._validate_data(X, dtype=[np.float64, np.float32], reset=False)
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/sklearn/base.py", line 566, in _validate_data
X = check_array(X, **check_params)
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/sklearn/utils/validation.py", line 800, in check_array
_assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/sklearn/utils/validation.py", line 114, in _assert_all_finite
raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
I can generate a model so I am happy works out for me, but even when I use the preexisting model with the model_file=f'{model_prefix}.model.dill' parameter, the same error occurs.
I've tried changing the number of highly variable genes I keep(ranging from the minimum 3000 to 10000) or even changing my number of clusters by changing n_neighbors, n_pcs, and/or resolution in sc.pp.neighbors() and sc.tl.leiden(), but varying those parameters didn't fix the error either.
The package works perfectly for me otherwise, it is just at this part do I start having errors. I haven't seen anyone else share this problem so I am hoping you know of a fix because I am at a dead end and troubleshooting by myself isn't working out so well.
Hope you're doing well, Todd