Robust number of dimension adopted in cellstate.predict

liu-xingliang commented 2 years ago

Hi the team,

I've noticed the default ndim and k is 10 and 20 separately for cellstate.predict function. My concern is these parameters may not fit large query dataset. What is the recommended way to adjust those parameters, for example, would it be rationale to adopt ndim PCs based on Seurat::ElbowPlot knee point (like Seurat did) on the "projected object" with integrated reference and query dataset:

projected.obj <- make.projection(query = q.obj, ref=r.obj, filter.cells=FALSE)
projected.obj <- projected.obj %>% ScaleData() %>% RunPCA(npcs = 100)
ElbowPlot(projected.obj, ndims = 50)

Interestingly, a large projected object with more than 90k cells over 699 integrating features showed knee point around 10 PCs in Seurat::ElbowPlot, that seems confirm the default parameter, :).

bless~ Xingliang

mass-a commented 2 years ago

Hello Xingliang, thanks for the message.

I would say that these parameters depend more on the reference than on the query dataset. The ndim parameter specifies the number of PCA components to use for calculating neighbors, and as a rule of thumb should reflect the complexity of the reference atlas (given that the subtypes in the reference are calculated using a limited number of PCA components). The k parameter refers to the number of neighbors used for assigning a cell type, and within a reasonable range (5 to 50) does not appear to affect much the prediction.

Best, -m

liu-xingliang commented 2 years ago

Thank you, @mass-a, I got your point, I agree that the ndim should depend on the complexity of reference dataset to provide enough "resolution" to project on.

carmonalab / ProjecTILs

Robust number of dimension adopted in cellstate.predict #27