Cluster labels when using a hard-coded prior

RGLab / flowClust

Bayesian flowClust

9 stars 8 forks source link

Cluster labels when using a hard-coded prior #26

Open carter-allen opened 1 year ago

carter-allen commented 1 year ago

In the vignette section Using Priors, the pfit2 object is supposed to have $K = 2$ mixture components, but when you check table(pfit2@label) you find that all observations are assigned to one cluster. However, according to the scatterplot of pfit2, there are 2 components. Is there are reason for the discrepancy?

gfinak commented 1 year ago

You ran the example locally and got a different result? The question is: are there reasons for this? Yes, there are. flowclust is not optimally maintained. I don't have time to devote to it like I have in the past. The package has seen three different authors and maintainers in its life so far. And the prior code found little use in practice. It is in the end, research code. Lots of it should probably be rewritten in a more modern style. The scope of use cases where I would trust the package to do work is for identifying populations in fsc /ssc space + a few other markers. That's been most used and best maintained. Some day I'll get to rewriting it.

carter-allen commented 1 year ago

Hi, thanks for the response! It is actually not a discrepancy between the vignette and the results I get locally. I am able to re-produce the vignette results exactly. However, when I check table(pfit2@label) after the final line of the vignette, I find that all observations are assigned to a single mixture component, despite plot(pfit2, data = rituximab2) displaying two mixture components.

I've found the package to work quite well for the use cases you mentioned, however I'd like to try to incorporate prior information. Would you recommend against using any non-default prior at this time?

Thanks in advance!

gfinak commented 1 year ago

I see. That sounds like a bug. It might be simple to resolve but it might not. I don't have a bioc dev environment available to me and I wouldn't be able to get to investigating it for some time. The flowclust fit object also has a slot that holds the probability of each cell belonging to each component. The rowwise argmax of that can give you cell level assignments but it wouldn't account for outliers, like the label slot is supposed to I believe.