choosing the optimal number of "Ks"

JEFworks-Lab / STdeconvolve

Reference-free cell-type deconvolution of multi-cellular spatially resolved transcriptomics data

http://jef.works/STdeconvolve/

98 stars 12 forks source link

choosing the optimal number of "Ks" #40

Closed cathalgking closed 10 months ago

cathalgking commented 10 months ago

What is the best way to choose the optimal number of K's or cell-types for a dataset? Is it just by observing the plot from the below code? How does one know what to set the upper limit to? i.e. in the below example, there could be more than 9 cell types present. ldas <- fitLDA(t(as.matrix(cd)), Ks = seq(2, 9, by = 1)) Also, my R session often crashes when running the above code.

bmill3r commented 10 months ago

Hi @cathalgking,

Thanks again for using STdeconvolve and for your questions! To provide some context, I'll point you towards a previous GitHub response:

https://github.com/JEFworks-Lab/STdeconvolve/issues/35#issuecomment-1501948275

In the example a max K of 9 was chosen for speed purposes, however in practice, a higher K could be used if you suspect more than 9 cell types in the data.

In terms of your R session crashing, what kind of errors are you seeing, if any? In terms of compute resources, are you possibly running out of memory? This has happened to me sometimes for very large datasets when fitting multiple models. I believe there is a way to change the max memory limit of R.

Let me know if you still have follow up questions and hope this helps, Brendan

cathalgking commented 10 months ago

I solved this thanks @bmill3r

cathalgking commented 10 months ago

@bmill3r I notice that the opt parameter in the optimalModel() function can take the option "min". Does this mean that it takes the lowest perplexity value where alpha <1 (not in a grey region) of the fitLDA plot? Is this the easiest way to choose K?

My 4 samples seem to vary a lot in terms of what K to choose. For instance, sample A seems to have an optimal K at 5 or 6?

While sample B seems to have an optimal K at around 16?

Other than this plot, is there any other way to ascertain the best K per sample?

bmill3r commented 10 months ago

Hi @cathalgking,

I believe that min just selects the model with lowest perplexity, but does not take into account alpha. It in theory would be the simplest, but because it does not account for alpha or the number of rare cell types, it might not be the best option. There are other options, such as

"kneed" = K vs perplexity inflection point.
"min" = K corresponding to minimum perplexity
"proportion" = K vs number of cell-type with mean proportion < 5% inflection point

but all of these currently do not take into account alpha, and whether they are truly identify the optimal K can be dataset dependent. So really, I would recommend using the plots to help guide selection of K.

Hope this helps, Brendan

cathalgking commented 10 months ago

Ok thanks @bmill3r . So would you say (just from looking at the plots) that the best K would be ~6 for the first plot and for the bottom plot ~ 16?

bmill3r commented 10 months ago

Hi @cathalgking,

Yes, looking at those plots I would say those are reasonable choices of K.

Brendan