cole-trapnell-lab / garnett

Automated cell type classification
MIT License
104 stars 25 forks source link

low cell count #16

Closed justinesjw closed 5 years ago

justinesjw commented 5 years ago

Hi!

Thanks for developing Garnett. It has been a great help to my research.

While using Garnett, I have been facing this error

training_sample
     DC Macro_A Macro_B Unknown 
     87     500     489      50 
Model training finished.
training_sample
  HEP_1   HEP_2   HEP_3   HEP_4   HEP_5   HEP_6   Hep_7 Unknown 
      8      36      16       3      14      12       4      50 
The following cell types have few training examples. Be careful with interpretation
[1] "HEP_4" "Hep_7"
<simpleError in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,     nvars, jd, vp, cl, ne, nx, nlam, flmin, ulam, thresh, isd,     intr, vnames, maxit, kopt, family): one multinomial or binomial class has 1 or 0 observations; not allowed>
GLMNET failed, excluding low count cell type: HEP_4 and trying again
The following cell types have few training examples. Be careful with interpretation
[1] "Hep_7"
<simpleError in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,     nvars, jd, vp, cl, ne, nx, nlam, flmin, ulam, thresh, isd,     intr, vnames, maxit, kopt, family): one multinomial or binomial class has 1 or 0 observations; not allowed>
GLMNET failed, excluding low count cell type: Hep_7 and trying again
Model training finished.

I understand that this is caused by low cell count and it was easily fixable by merging some groups that have similar DE genes. However, looking at my marker ambiguity plot, number of cells captured by the marker list is more that those that passed the training neither do they have overlapping genes with other subclusters.

image image image image

I was wondering if you can help me understand more about this problem?

Any help will be much appreciated.

Thanks, Justine

hpliner commented 5 years ago

Hi Justine,

Sorry for the confusion. The number of candidate cells from check_markers versus the actual training are different because check_markers uses heuristics to guess at the number. Some details: in order to identify training cells, Garnett calculates an aggregate marker score for each of the cells - i.e. a score based on all of the genes per cell type. It then chooses cells in the 75th percentile or above for aggregate score in only 1 cell type. check_markers does the first part of this - calculates the aggregate marker score - both with and without each gene as a heuristic measurement of the effect of including each gene in the marker file, but does not do the final step of choosing cells that have uniquely high marker scores.

Often when cell types are very similar, there are a lot of cells that have high aggregate scores for multiple cell types, so they don't get chosen for training, but do look like candidates in the check_markers plot. One potential solution if you believe that there should be more good cell candidates for your lower populations (HEP_4 and HEP_7) is to see if you can find any additional marker genes to add for them. This will sometimes bump up the aggregate score for one cell type high enough to be included.

Hope this helps, if you have further question, reopen!