Getting error when running predictions

dadarenedo commented 1 year ago

predictions = celltypist.annotate(adata, model = 'Immune_All_Low.pkl', majority_voting = True)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I check adata.X and I don't have NaN values, also I shorten the values. Not sure what else to try.

ChuanXu1 commented 1 year ago

@dadarenedo, according to the error message, there should be some nan in the data. Could you confirm by np.isnan(adata.X.data).sum() to see whether it returns 0?

curryfly5 commented 1 year ago

I met the same question when I train the pkl for my subtype（Lineage），every time I subset the specific Lineage from the whole adata，use the sub_adata to run celltypist

predictions = celltypist.annotate(adata_query, model = '/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/ref_il/model_from_reil_22.11.27.pkl', majority_voting = True)

I got the _ValueError: Input contains NaN, infinity or a value too large for dtype('float64')._Maybe need some imputation?

ChuanXu1 commented 1 year ago

@curryfly5, did you use the log-normalised expression for both training and prediction? If you could paste your code relating to model training and prediction to reproduce the above error, that will be great for us to debug.

curryfly5 commented 1 year ago

@curryfly5, did you use the log-normalised expression for both training and prediction? If you could paste your code relating to model training and prediction to reproduce the above error, that will be great for us to debug.

glad to receive your prompt reply, the training process is perfect, and I got the model successfully. But when I use the model for prediction, I got error, here is my code

si=sc.read_h5ad('/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/si_bbknn.h5ad')
endo=si[si.obs['Lineage']=='Endothelial lineage']
predictions = celltypist.annotate(endo, model = '/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo_human.pkl', majority_voting = True)

and the Traceback is

🔬 Input data has 1100 cells and 2352 genes
🔗 Matching reference genes in the model
🧬 135 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
ValueError                                Traceback (most recent call last)
/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo.ipynb  32 in ()
      [2](vscode-notebook-cell://ssh-remote%2Byupf212/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo.ipynb#X42sdnNjb2RlLXJlbW90ZQ%3D%3D?line=1) endo=si[si.obs['Lineage']=='Endothelial lineage']
      [3](vscode-notebook-cell://ssh-remote%2Byupf212/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo.ipynb#X42sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2) t_start = time.time()
----> [4](vscode-notebook-cell://ssh-remote%2Byupf212/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo.ipynb#X42sdnNjb2RlLXJlbW90ZQ%3D%3D?line=3) predictions = celltypist.annotate(endo, model = '/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo_human.pkl', majority_voting = True)
      [5](vscode-notebook-cell://ssh-remote%2Byupf212/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo.ipynb#X42sdnNjb2RlLXJlbW90ZQ%3D%3D?line=4) t_end = time.time()
      [6](vscode-notebook-cell://ssh-remote%2Byupf212/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo.ipynb#X42sdnNjb2RlLXJlbW90ZQ%3D%3D?line=5) print(f"Time elapsed: {t_end - t_start} seconds")

File [~/miniconda3/envs/scanly/lib/python3.8/site-packages/celltypist/annotate.py:81](https://vscode-remote+ssh-002dremote-002byupf212.vscode-resource.vscode-cdn.net/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/~/miniconda3/envs/scanly/lib/python3.8/site-packages/celltypist/annotate.py:81), in annotate(filename, model, transpose_input, gene_file, cell_file, mode, p_thres, majority_voting, over_clustering, min_prop)
     79 clf = classifier.Classifier(filename = filename, model = lr_classifier, transpose = transpose_input, gene_file = gene_file, cell_file = cell_file)
     80 #predict
---> 81 predictions = clf.celltype(mode = mode, p_thres = p_thres)
     82 if not majority_voting:
     83     return predictions

File [~/miniconda3/envs/scanly/lib/python3.8/site-packages/celltypist/classifier.py:376](https://vscode-remote+ssh-002dremote-002byupf212.vscode-resource.vscode-cdn.net/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/~/miniconda3/envs/scanly/lib/python3.8/site-packages/celltypist/classifier.py:376), in Classifier.celltype(self, mode, p_thres)
    373 self.model.classifier.coef_ = self.model.classifier.coef_[:, lr_idx]
    375 logger.info("🖋️ Predicting labels")
--> 376 decision_mat, prob_mat, lab = self.model.predict_labels_and_prob(self.indata, mode = mode, p_thres = p_thres)
    377 logger.info("✅ Prediction done!")
    379 #restore model after prediction

File [~/miniconda3/envs/scanly/lib/python3.8/site-packages/celltypist/models.py:145](https://vscode-remote+ssh-002dremote-002byupf212.vscode-resource.vscode-cdn.net/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/~/miniconda3/envs/scanly/lib/python3.8/site-packages/celltypist/models.py:145), in Model.predict_labels_and_prob(self, indata, mode, p_thres)
    123 def predict_labels_and_prob(self, indata, mode: str = 'best match', p_thres: float = 0.5) -> tuple:
    124     """
    125     Get the decision matrix, probability matrix, and predicted cell types for the input data.
    126 
   (...)
    143         A tuple of decision score matrix, raw probability matrix, and predicted cell type labels.
    144     """
--> 145     scores = self.classifier.decision_function(indata)
    146     if len(self.cell_types) == 2:
    147         scores = np.column_stack([-scores, scores])

File [~/miniconda3/envs/scanly/lib/python3.8/site-packages/sklearn/linear_model/_base.py:429](https://vscode-remote+ssh-002dremote-002byupf212.vscode-resource.vscode-cdn.net/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/~/miniconda3/envs/scanly/lib/python3.8/site-packages/sklearn/linear_model/_base.py:429), in LinearClassifierMixin.decision_function(self, X)
    409 """
    410 Predict confidence scores for samples.
    411 
   (...)
    425     this class would be predicted.
    426 """
    427 check_is_fitted(self)
--> 429 X = self._validate_data(X, accept_sparse="csr", reset=False)
    430 scores = safe_sparse_dot(X, self.coef_.T, dense_output=True) + self.intercept_
...
    148 # for object dtype data, we only check for NaNs (GH-13254)
    149 elif X.dtype == np.dtype("object") and not allow_nan:

ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

the problem seems to happen on the endo=si[si.obs['Lineage']=='Endothelial lineage']

ChuanXu1 commented 1 year ago

@curryfly5, what is your code for training the model? And seems your data has fewer genes (2352) than what will be expected (~20-30k).

curryfly5 commented 1 year ago

@curryfly5, what is your code for training the model? And seems your data has fewer genes (2352) than what will be expected (~20-30k).

Thanks! I will check my data, it seems to be the subset of highly_variable_genes, I will try raw data. here is my training code!

sampled_cell_index = celltypist.samples.downsample_adata(ref, mode = 'each', n_cells = 1500, by = 'annotation', return_index = True)
model_fs = celltypist.train(ref[sampled_cell_index], 'annotation', n_jobs = 10, max_iter = 5, use_SGD = True)

✂️ 5481 non-expressed genes are filtered out ⚖️ Scaling input data 🏋️ Training data using SGD logistic regression ✅ Model training done!

gene_index = np.argpartition(np.abs(model_fs.classifier.coef_), -200, axis = 1)[:, -200:]
gene_index = np.unique(gene_index)

Number of genes selected: 2056

model = celltypist.train(ref[sampled_cell_index, gene_index], 'annotation', check_expression = False, n_jobs = 10, max_iter = 300)
model.write('/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo_human.pkl')

🍳 Preparing data before training ✂️ 366 non-expressed genes are filtered out ⚖️ Scaling input data 🏋️ Training data using logistic regression ✅ Model training done!

ChuanXu1 commented 1 year ago

@curryfly5, the training code looks fine. For prediction, you can try the log-normalised expression from all genes and let me know if the error persists.

curryfly5 commented 1 year ago

@curryfly5, the training code looks fine. For prediction, you can try the log-normalised expression from all genes and let me know if the error persists.

Thanks again! while I use the all gene data, it works well! endo=endo.raw.to_adata()

predictions = celltypist.annotate(endo, model = '/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo_human.pkl', majority_voting = True)

🔬 Input data has 1100 cells and 19019 genes 🔗 Matching reference genes in the model 🧬 945 features used for prediction ⚖️ Scaling input data 🖋️ Predicting labels ✅ Prediction done! 👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it ⛓️ Over-clustering input data with resolution set to 5 🗳️ Majority voting the predictions ✅ Majority voting done!

Teichlab / celltypist

Getting error when running predictions #51