deweylab / CellO

CellO: Gene expression-based hierarchical cell type classification using the Cell Ontology
MIT License
61 stars 12 forks source link

Error when Transforming with PCA... #20

Closed ScreachingFire closed 1 year ago

ScreachingFire commented 2 years ago

Hello DeweyLab,

I am working with a dataset of ~22k cells and ~22k genes, and after generating a model with cello.scanpy_cello(), I keep getting an error when transforming with PCA:

Transforming with PCA... Traceback (most recent call last): File "/Users/name/Documents/Bio Scripts/Python/scRNAseq & RNAseq Analysis/Human/Skeletal Muscle/SkelMuscleAnalysis", line 178, in cello.scanpy_cello(adata, clust_key='leiden', rsrc_loc=cello_resource_loc, model_file=f'{two_model_prefix}.model.dill') File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/cello/scanpy_cello.py", line 146, in cello results_df, finalized_binary_results_df, ms_results_df = ce.predict( File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/cello/cello.py", line 232, in predict results_df, cell_to_clust = _raw_probabilities( File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/cello/cello.py", line 437, in _raw_probabilities conf_df, score_df = mod.predict(expr, ad_clust.obs.index) File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/cello/models/model.py", line 89, in predict X = self._preprocess(X) File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/cello/models/model.py", line 85, in _preprocess X = prep.transform(X) File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/cello/models/pca.py", line 40, in transform X = self.model.transform(X) File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/sklearn/decomposition/_base.py", line 117, in transform X = self._validate_data(X, dtype=[np.float64, np.float32], reset=False) File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/sklearn/base.py", line 566, in _validate_data X = check_array(X, **check_params) File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/sklearn/utils/validation.py", line 800, in check_array _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan") File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/sklearn/utils/validation.py", line 114, in _assert_all_finite raise ValueError( ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

I can generate a model so I am happy works out for me, but even when I use the preexisting model with the model_file=f'{model_prefix}.model.dill' parameter, the same error occurs.

I've tried changing the number of highly variable genes I keep(ranging from the minimum 3000 to 10000) or even changing my number of clusters by changing n_neighbors, n_pcs, and/or resolution in sc.pp.neighbors() and sc.tl.leiden(), but varying those parameters didn't fix the error either.

The package works perfectly for me otherwise, it is just at this part do I start having errors. I haven't seen anyone else share this problem so I am hoping you know of a fix because I am at a dead end and troubleshooting by myself isn't working out so well.

Hope you're doing well, Todd

ScreachingFire commented 2 years ago

Additional notes: I checked for NaNs and Infinities before calling cello.scanpy_cello() and could find none, so I am thinking that these values are popping up while the function is being called.

mbernste commented 2 years ago

Hi, I am sorry you are running into this issue. Thank you for providing the traceback. It looks to me like the error is occurring inside the PCA function that is implemented by scikit-learn. Specifically, it looks to me like it is breaking due to a NaN value. Would it be possible to send me either the code you are using to normalize the data, or, if possible, the dataset that is producing these errors?

ScreachingFire commented 2 years ago

Would it be possible to send me either the code you are using to normalize the data, or, if possible, the dataset that is producing these errors?

Dataset: https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE143704&format=file&file=GSE143704%5FDeMicheli%5FHumanMuscleAtlas%5Frawdata%2Etxt%2Egz Code:

adata = sc.read_text('/Users/name/Documents/Bio Data/Human/SkeletalMuscle/Dataset2(22kCell)/GSE143704_DeMicheli_HumanMuscleAtlas_rawdata.txt').transpose()
adata.var_names_make_unique()

adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
sc.pp.filter_cells(adata, max_counts=5050)
sc.pp.filter_cells(adata, min_genes=510)
adata = adata[adata.obs.pct_counts_mt < 22, :]

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata, base=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5, n_bins=20, n_top_genes=7500)
adata = adata[:, adata.var.highly_variable]

sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
sc.pp.scale(adata, max_value=10)

sc.tl.pca(adata, svd_solver='arpack', n_comps=60)

More notes: Prior to Transforming with PCA... I do get two more errors(other than the 'variables are not unique' which is not important), they don't stop the program so I assumed they may be insignificant but now I am thinking otherwise:

/p-scRNAseq/lib/python3.9/site-packages/cello/cello.py:459: RuntimeWarning: invalid value encountered in log x_clust = np.log(x_clust+1)

/p-scRNAseq/lib/python3.9/site-packages/cello/cello.py:477: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. PassAnnData(X, dtype=X.dtype, ...)to get the future behavour. ad_mean_clust = AnnData(

adata.X.dtype prior to calling cello.scanpy_cello() is still float32 so I don't know why the second error is popping up.

byorntan commented 1 year ago

Would it be possible to send me either the code you are using to normalize the data, or, if possible, the dataset that is producing these errors?

Dataset: https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE143704&format=file&file=GSE143704%5FDeMicheli%5FHumanMuscleAtlas%5Frawdata%2Etxt%2Egz Code:

adata = sc.read_text('/Users/name/Documents/Bio Data/Human/SkeletalMuscle/Dataset2(22kCell)/GSE143704_DeMicheli_HumanMuscleAtlas_rawdata.txt').transpose()
adata.var_names_make_unique()

adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
sc.pp.filter_cells(adata, max_counts=5050)
sc.pp.filter_cells(adata, min_genes=510)
adata = adata[adata.obs.pct_counts_mt < 22, :]

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata, base=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5, n_bins=20, n_top_genes=7500)
adata = adata[:, adata.var.highly_variable]

sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
sc.pp.scale(adata, max_value=10)

sc.tl.pca(adata, svd_solver='arpack', n_comps=60)

More notes: Prior to Transforming with PCA... I do get two more errors(other than the 'variables are not unique' which is not important), they don't stop the program so I assumed they may be insignificant but now I am thinking otherwise:

/p-scRNAseq/lib/python3.9/site-packages/cello/cello.py:459: RuntimeWarning: invalid value encountered in log x_clust = np.log(x_clust+1)

/p-scRNAseq/lib/python3.9/site-packages/cello/cello.py:477: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. PassAnnData(X, dtype=X.dtype, ...)to get the future behavour. ad_mean_clust = AnnData(

adata.X.dtype prior to calling cello.scanpy_cello() is still float32 so I don't know why the second error is popping up.

I encountered a similar issue to this, and realised that the step on regressing out pct_counts_mt (sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])) was the culprit here as it would cause the already log1p-transformed values to be shifted away to more negative values, whereas the _aggregate_expression helper function assumes that all values are the pre-regression counts (cello.py line 455: x_clust = np.squeeze(np.array(np.sum(X, axis=0)))).

Skipping the regress_out step resolves this issue.

I'm not entirely sure how this can be addressed though - @mbernste is this issue already known?