Teichlab / celltypist

A tool for semi-automatic cell type classification
https://www.celltypist.org/
MIT License
254 stars 40 forks source link

Allow to specify the AnnData var field that has the gene_symbols instead of only relying on var_names #87

Closed pcm32 closed 9 months ago

pcm32 commented 9 months ago

Hi there, thanks for this great tool.

Sometimes, AnnData files are indexed by ENSEMBL / NCBI gene identifiers rather than gene symbols. Could you add a parameter to the CLI to specify a field from the var from where to get the gene_symbols? Otherwise, for using the CLI, one would need to read in the AnnData into memory, do some modification to change the index (if at all possible) of var and then rewrite the AnnData, which can be a lot of time and disk space as well. It would be much nicer if it can be handled in in-memory in the CLI.

ChuanXu1 commented 9 months ago

@pcm32, you can provide a symbol-to-ID mapping file (one column being gene symbols and the other column being IDs), and use it to convert the model from gene symbols to Ensembl IDs.

#load a model
model = celltypist.Model.load("some_model.pkl")
#convert the model
model.convert("path_to_your_mapping_file")
#prediction
predictions = celltypist.annotate(input_data, model = model)

In the new version (1.6.1), I also added a mapping file based on GENCODE version 44. So you can use it if you do not find a mapping file yourself. Details can be found in the online tutorial (https://github.com/Teichlab/celltypist -> Usage (classification) -> Supplemental guidance -> Model conversion from gene symbols to Ensembl IDs)

pcm32 commented 9 months ago

Converting the models as suggested works, thanks!