Gene names VS gene IDs in precomputed models

emdann commented 2 years ago

Hello celltypers,

While using a trained celltypist model on my data, I got this error. It took me a little while to realise it was coming from having mismatched feature names: my adata.var_names are EnsemblIDs while the model uses gene names.

predictions = celltypist.annotate(adata, model = 'Immune_All_Low.pkl', majority_voting = True)

🔬 Input data has 634000 cells and 5000 genes
🔗 Matching reference genes in the model
🧬 0 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-248-0b4cb11719f9> in <module>
----> 1 predictions = celltypist.annotate(adata, model = 'Immune_All_Low.pkl', majority_voting = True)

~/my-conda-envs/emma_env/lib/python3.7/site-packages/celltypist/annotate.py in annotate(filename, model, transpose_input, gene_file, cell_file, mode, p_thres, majority_voting, over_clustering, min_prop)
     79     clf = classifier.Classifier(filename = filename, model = lr_classifier, transpose = transpose_input, gene_file = gene_file, cell_file = cell_file)
     80     #predict
---> 81     predictions = clf.celltype(mode = mode, p_thres = p_thres)
     82     if not majority_voting:
     83         return predictions

~/my-conda-envs/emma_env/lib/python3.7/site-packages/celltypist/classifier.py in celltype(self, mode, p_thres)
    349 
    350         logger.info("🖋️ Predicting labels")
--> 351         decision_mat, prob_mat, lab = self.model.predict_labels_and_prob(self.indata, mode = mode, p_thres = p_thres)
    352         logger.info("✅ Prediction done!")
    353 

~/my-conda-envs/emma_env/lib/python3.7/site-packages/celltypist/models.py in predict_labels_and_prob(self, indata, mode, p_thres)
    118             A tuple of decision score matrix, raw probability matrix, and predicted cell type labels.
    119         """
--> 120         scores = self.classifier.decision_function(indata)
    121         probs = expit(scores)
    122         if mode == 'best match':

~/my-conda-envs/emma_env/lib/python3.7/site-packages/sklearn/linear_model/_base.py in decision_function(self, X)
    280         check_is_fitted(self)
    281 
--> 282         X = check_array(X, accept_sparse='csr')
    283 
    284         n_features = self.coef_.shape[1]

~/my-conda-envs/emma_env/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~/my-conda-envs/emma_env/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    659                              " a minimum of %d is required%s."
    660                              % (n_features, array.shape, ensure_min_features,
--> 661                                 context))
    662 
    663     if copy and np.may_share_memory(array, array_orig):

ValueError: Found array with 0 feature(s) (shape=(634000, 0)) while a minimum of 1 is required.

This made me think of two suggestions:

Could the error message for this case become a bit more informative? If the feature overlap is 0, then print a message saying "Are you using gene names is adata.var_names?" With the current message I first thought it was triggered by having all zeros in some row or column
Using gene names while matching info between datasets can be problematic, because of name duplication or mismatches in different gene annotation databases. Would it be possible to also store unique geneIDs in the model objects (e.g. ensembl IDs) and give an option to select the type of feature names to use in celltypist.annotate?

ChuanXu1 commented 2 years ago

@emdann

information relating to no-feature overlap is logged and output after version 0.2.0 #15
good suggestion! will add this

emdann commented 2 years ago

brilliant! Thank you

ChuanXu1 commented 2 years ago

@emdann, as I tested, some Ensembl IDs can match multiple gene symbols, it's not so intuitive to store Ensembl IDs along with gene symbols in a single model. Moreover, as well as Ensembl IDs, the users may have other needs (HGNC ID, old gene symbols, etc.)

Therefore, there is a convert method, which is initially designed to convert human/mouse model to mouse/house model by mapping orthologous genes. This method can be also used for the Ensembl ID case, where the users provide a map file to convert the genes in the model to other formats (Ensembl IDs, HGNC, orthologous genes, ...)

model = celltypist.Model.load("some_model.pkl") model.convert(map_file = 'symbol2ID.csv') #the map file provided by the user based on what they'd like to transform the gene symbols to model.write("/path/to/converted_some_model.pkl")

I think this should be a better way to deal with the case you encounter.

Also see the Usage -> Supplemental guidance -> Cross-species model conversion

Teichlab / celltypist

Gene names VS gene IDs in precomputed models #26