cole-trapnell-lab / garnett

Automated cell type classification
MIT License
104 stars 25 forks source link

Marker list complexity for Garnett classifier #60

Closed hbandukw closed 2 years ago

hbandukw commented 2 years ago

Hello,

I am currently working on a marker list for generating and training a classifier with a dataset of ~ 350,000 cells. My market list currently contains definitions for 25 cell types. For each cell type, I am leveraging a list of DEGs to define the set of genes that are expressed and not expressed. As you can imagine, the list can get quite long, I was therefore wondering how complex can a cell type definition be?

Is it adequate for a few (<5) high-confidence genes to define a cell-type? Is there an upper limit on how long the list of genes (that describe a cell type) can be? In other words, what is the cost associated with having very fine-grained cell-type definitions?

hpliner commented 2 years ago

Hello, sorry for the delay. I have found that the algorithm works much better with a few very good markers (1-5) rather than many less good markers. There is no limit to the complexity/number of genes that can be included, but you'll likely have better luck on the smaller end.

The main cost of lots of markers is they may add noise, especially if there's overlap in expression between cell types. I've generally had the best luck starting small and building up to something more complex.

In your case, it sounds like you already have a dataset with cell type definitions. If that's the case, I recommend also giving the 'marker-free' version of Garnett a try. See here: https://cole-trapnell-lab.github.io/garnett/docs_m3/#1c-train-a-marker-free-classifier