Open ejarmand opened 2 days ago
Hi, I assume it’s easy to implement for all methods. It would just be a second loop. I won’t have the bandwidth to do it this month. Currently, subsetting genes outside and disabling hvg selection and running it separately would be my recommendation. I’m a bit confused though. How high is the expression for this single gene? How many cells of that type would you expect to have zero observed expression given Poisson sampling? I guess it might be that this single gene is an actual marker gene but other differences in expression allow to cluster those cells distinctly. Does this make sense?
Hi Can, I totally understand your thoughts on the single gene clustering. This has come up a couple times with collaborators, and usually when sub-clustering a largely homogeneous cell type (think it can also be exacerbated by choices in dimensionality reduction, and have seen it enhanced by certain residual normalization procedures). Probably not a realistic example when applied reference mapping an entire dataset at once. Sometimes there are reasonable correlates (e.g. sequencing depth) and sometimes there aren't.
Regardless that was mostly meant as an unambiguous example of gene-selection effects rather than the primary use case.
Working primarily in brain tissues annotating subclusters is pretty common and seems to be even more sensitive to gene panel selection.
Description of feature
One of the primary drivers of sc analysis is often marker gene selection. I would likely expect this to have a larger impact than algorithm choice in most cases. Ideally sampling across the space of possible gene sets for integration would be very interesting and useful (I've seen multiple cases of clusters driven by a single gene).
For unsupervised methods in particular it should be pretty easy to implement.
Edited: many -> multiple, swapped words