Closed dadarenedo closed 1 year ago
@dadarenedo, according to the error message, there should be some nan in the data. Could you confirm by np.isnan(adata.X.data).sum()
to see whether it returns 0?
I met the same question when I train the pkl for my subtype(Lineage),every time I subset the specific Lineage from the whole adata,use the sub_adata to run celltypist
predictions = celltypist.annotate(adata_query, model = '/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/ref_il/model_from_reil_22.11.27.pkl', majority_voting = True)
I got the _ValueError: Input contains NaN, infinity or a value too large for dtype('float64')._Maybe need some imputation?
@curryfly5, did you use the log-normalised expression for both training and prediction? If you could paste your code relating to model training and prediction to reproduce the above error, that will be great for us to debug.
@curryfly5, did you use the log-normalised expression for both training and prediction? If you could paste your code relating to model training and prediction to reproduce the above error, that will be great for us to debug.
glad to receive your prompt reply, the training process is perfect, and I got the model successfully. But when I use the model for prediction, I got error, here is my code
si=sc.read_h5ad('/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/si_bbknn.h5ad')
endo=si[si.obs['Lineage']=='Endothelial lineage']
predictions = celltypist.annotate(endo, model = '/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo_human.pkl', majority_voting = True)
and the Traceback is
🔬 Input data has 1100 cells and 2352 genes
🔗 Matching reference genes in the model
🧬 135 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
ValueError Traceback (most recent call last)
/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo.ipynb 32 in ()
[2](vscode-notebook-cell://ssh-remote%2Byupf212/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo.ipynb#X42sdnNjb2RlLXJlbW90ZQ%3D%3D?line=1) endo=si[si.obs['Lineage']=='Endothelial lineage']
[3](vscode-notebook-cell://ssh-remote%2Byupf212/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo.ipynb#X42sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2) t_start = time.time()
----> [4](vscode-notebook-cell://ssh-remote%2Byupf212/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo.ipynb#X42sdnNjb2RlLXJlbW90ZQ%3D%3D?line=3) predictions = celltypist.annotate(endo, model = '/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo_human.pkl', majority_voting = True)
[5](vscode-notebook-cell://ssh-remote%2Byupf212/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo.ipynb#X42sdnNjb2RlLXJlbW90ZQ%3D%3D?line=4) t_end = time.time()
[6](vscode-notebook-cell://ssh-remote%2Byupf212/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo.ipynb#X42sdnNjb2RlLXJlbW90ZQ%3D%3D?line=5) print(f"Time elapsed: {t_end - t_start} seconds")
File [~/miniconda3/envs/scanly/lib/python3.8/site-packages/celltypist/annotate.py:81](https://vscode-remote+ssh-002dremote-002byupf212.vscode-resource.vscode-cdn.net/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/~/miniconda3/envs/scanly/lib/python3.8/site-packages/celltypist/annotate.py:81), in annotate(filename, model, transpose_input, gene_file, cell_file, mode, p_thres, majority_voting, over_clustering, min_prop)
79 clf = classifier.Classifier(filename = filename, model = lr_classifier, transpose = transpose_input, gene_file = gene_file, cell_file = cell_file)
80 #predict
---> 81 predictions = clf.celltype(mode = mode, p_thres = p_thres)
82 if not majority_voting:
83 return predictions
File [~/miniconda3/envs/scanly/lib/python3.8/site-packages/celltypist/classifier.py:376](https://vscode-remote+ssh-002dremote-002byupf212.vscode-resource.vscode-cdn.net/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/~/miniconda3/envs/scanly/lib/python3.8/site-packages/celltypist/classifier.py:376), in Classifier.celltype(self, mode, p_thres)
373 self.model.classifier.coef_ = self.model.classifier.coef_[:, lr_idx]
375 logger.info("🖋️ Predicting labels")
--> 376 decision_mat, prob_mat, lab = self.model.predict_labels_and_prob(self.indata, mode = mode, p_thres = p_thres)
377 logger.info("✅ Prediction done!")
379 #restore model after prediction
File [~/miniconda3/envs/scanly/lib/python3.8/site-packages/celltypist/models.py:145](https://vscode-remote+ssh-002dremote-002byupf212.vscode-resource.vscode-cdn.net/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/~/miniconda3/envs/scanly/lib/python3.8/site-packages/celltypist/models.py:145), in Model.predict_labels_and_prob(self, indata, mode, p_thres)
123 def predict_labels_and_prob(self, indata, mode: str = 'best match', p_thres: float = 0.5) -> tuple:
124 """
125 Get the decision matrix, probability matrix, and predicted cell types for the input data.
126
(...)
143 A tuple of decision score matrix, raw probability matrix, and predicted cell type labels.
144 """
--> 145 scores = self.classifier.decision_function(indata)
146 if len(self.cell_types) == 2:
147 scores = np.column_stack([-scores, scores])
File [~/miniconda3/envs/scanly/lib/python3.8/site-packages/sklearn/linear_model/_base.py:429](https://vscode-remote+ssh-002dremote-002byupf212.vscode-resource.vscode-cdn.net/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/~/miniconda3/envs/scanly/lib/python3.8/site-packages/sklearn/linear_model/_base.py:429), in LinearClassifierMixin.decision_function(self, X)
409 """
410 Predict confidence scores for samples.
411
(...)
425 this class would be predicted.
426 """
427 check_is_fitted(self)
--> 429 X = self._validate_data(X, accept_sparse="csr", reset=False)
430 scores = safe_sparse_dot(X, self.coef_.T, dense_output=True) + self.intercept_
...
148 # for object dtype data, we only check for NaNs (GH-13254)
149 elif X.dtype == np.dtype("object") and not allow_nan:
ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
the problem seems to happen on the endo=si[si.obs['Lineage']=='Endothelial lineage']
@curryfly5, what is your code for training the model? And seems your data has fewer genes (2352) than what will be expected (~20-30k).
@curryfly5, what is your code for training the model? And seems your data has fewer genes (2352) than what will be expected (~20-30k).
Thanks! I will check my data, it seems to be the subset of highly_variable_genes, I will try raw data. here is my training code!
sampled_cell_index = celltypist.samples.downsample_adata(ref, mode = 'each', n_cells = 1500, by = 'annotation', return_index = True)
model_fs = celltypist.train(ref[sampled_cell_index], 'annotation', n_jobs = 10, max_iter = 5, use_SGD = True)
✂️ 5481 non-expressed genes are filtered out ⚖️ Scaling input data 🏋️ Training data using SGD logistic regression ✅ Model training done!
gene_index = np.argpartition(np.abs(model_fs.classifier.coef_), -200, axis = 1)[:, -200:]
gene_index = np.unique(gene_index)
Number of genes selected: 2056
model = celltypist.train(ref[sampled_cell_index, gene_index], 'annotation', check_expression = False, n_jobs = 10, max_iter = 300)
model.write('/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo_human.pkl')
🍳 Preparing data before training ✂️ 366 non-expressed genes are filtered out ⚖️ Scaling input data 🏋️ Training data using logistic regression ✅ Model training done!
@curryfly5, the training code looks fine. For prediction, you can try the log-normalised expression from all genes and let me know if the error persists.
@curryfly5, the training code looks fine. For prediction, you can try the log-normalised expression from all genes and let me know if the error persists.
Thanks again! while I use the all gene data, it works well!
endo=endo.raw.to_adata()
predictions = celltypist.annotate(endo, model = '/disk212/yupf/database/scRNA-seq/scanpy_analysis/atlas/lineage/SI/ENDO/endo_human.pkl', majority_voting = True)
🔬 Input data has 1100 cells and 19019 genes 🔗 Matching reference genes in the model 🧬 945 features used for prediction ⚖️ Scaling input data 🖋️ Predicting labels ✅ Prediction done! 👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it ⛓️ Over-clustering input data with resolution set to 5 🗳️ Majority voting the predictions ✅ Majority voting done!
predictions = celltypist.annotate(adata, model = 'Immune_All_Low.pkl', majority_voting = True)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I check adata.X and I don't have NaN values, also I shorten the values. Not sure what else to try.