Can not detect a neighborhood graph, will construct one before the over-clustering

sorry for the very basic questions but i was following this for a custom reference https://colab.research.google.com/github/Teichlab/celltypist/blob/main/docs/notebook/celltypist_tutorial_cv.ipynb#scrollTo=therapeutic-mixture

1) predictions = celltypist.annotate(adata, model = new_model, majority_voting = True) I received the following warning: Can not detect a neighborhood graph, will construct one before the over-clustering and now it is running for a long time. what did i need to do other than normalizing and log ? also after finishing i compared the resulting umap of the reference to standard scanpy preprossing without celltypist and the umaps are different? what is going on under the hood?

2)Another question I have is if the labels argument can take multiple labels because the reference has major_celltype and a more specific_celltype annotation so how can I transfer both? new_model = celltypist.train(adata, labels = 'major_celltype', n_jobs = 10, feature_selection = True)

3)what does n_jobs mean?

4) regarding this part "Overall, we suggest the users to perform their own feature selection before training to alleviate the training burden." i already have a list of markers for each cell type how to use it ? can celltypist can be used on a list not a reference?

5) my last question is about the dotplot of original labels and predicted or majority_vote labels, what does it mean to have few blue dots, that the model is weak in predicting these cell types?

@Sirin24,

CellTypist performs an over-clustering step. If you already have a neighborhood graph in place (i.e., from sc.pp.neighbors), CellTypist will use it; otherwise, a standard Scanpy protocol will be run to construct one. The long runtime is caused by this step. You can set majority_voting = False to skip the majority voting step, or supply your own neighborhood graph calculated in advance for over clustering.
You need to train two models separately.
Number of cpus for one-vs-rest logistic regression training (each cell type takes one cpu).
You can union markers from each cell type and supply the resulting gene list for training. A similar issue is here https://github.com/Teichlab/celltypist/issues/107
Blue means a probability of <0.5. For example, if you use a blood reference to predict brain cells, all microglia (100%, big dot in the dot plot) in the brain will be assigned to macrophages in the blood as this is the best guess; however, the probability will be low.

Teichlab / celltypist

Can not detect a neighborhood graph, will construct one before the over-clustering #111