Irrationone / cellassign

Automated, probabilistic assignment of cell types in scRNA-seq data
Other
191 stars 79 forks source link

need more information on inputs to cellasssign #50

Closed olechnwin closed 4 years ago

olechnwin commented 4 years ago

Hi @kieranrcampbell ,

I have the same error as in #44 "Error in cellassign(exprs_obj = sce, marker_gene_info = bone.markers.mat, : nrow(rho) == G is not TRUE" .

I read the paper and I feel like I missunderstood the method. Can you please elaborate on what should be the input of cellassign? I have done pre-processing and clustering of cells using Seurat.

  1. Regarding marker_gene_info, the genes listed should come from prior knowledge of marker genes for each cell type that you expect to see in your data, correct?

  2. what should be in exprs_obj? Does it only contain all the genes specified in marker_gene_info with their raw counts? I don't get how cellassign will be able to determine the cell types just by these small subset of expression without taking into account the expression of other genes in the cells.

  3. Do we need to run computeSumFactors on exprs_obj or the entire gene expression counts?

Thanks in advance for your clarification.

kieranrcampbell commented 4 years ago

Regarding marker_gene_info, the genes listed should come from prior knowledge of marker genes for each cell type that you expect to see in your data, correct?

Correct

what should be in exprs_obj? Does it only contain all the genes specified in marker_gene_info with their raw counts? I don't get how cellassign will be able to determine the cell types just by these small subset of expression without taking into account the expression of other genes in the cells.

exprs_obj should correspond only to the marker genes listed in marker_gene_info

Do we need to run computeSumFactors on exprs_obj or the entire gene expression counts?

Best to run it on the entire gene expression matrix before subsetting to just the marker genes for input to cellassign.

I'm going to

  1. Update the vignette to clarify these points as they're important
  2. Make that error more informative

Thanks

olechnwin commented 4 years ago

@kieranrcampbell,

Thank you so much for your quick response. The expr_obj should be raw counts, correct? Also can you please give a brief intuition as to how the algorithm is able to determine the cell types based on only expression of markers genes?

kieranrcampbell commented 4 years ago

Yup expr_obj should be raw counts.

In the methods section of the publication there's actually a paragraph on the intuition behind the model "The intuition is that if gene g is a marker for cell type c,..." -- let me know if anything still unclear.

olechnwin commented 4 years ago

Thanks, @kieranrcampbell ! Reading it the second time, makes it more clear :-) So the expression of a gene marker g is compared across cells and it should be highly expressed in the cells it is a marker of and not other cells. In the beginning, I mistakenly thought that to determine whether or not gene g is highly expressed you need to know the expression of other genes in the same cell. BTW, congrats on publishing a great paper! and thanks again for responding to questions.

kieranrcampbell commented 4 years ago

Yes exactly. But the beauty behind probabilistic (generative) modelling is that you can write down a model of how you expect the world to work (so in this case we say if I'm a cell of a given cell type I'm going to over-express the marker genes for that type) and then performing statistical inference automatically tells you what cells are of what cell types, so in a sense there's no actual "comparison".

Hope that helps

kieranrcampbell commented 4 years ago

Updated examples and docs to make it obvious only markers should be used in 176cb645e449f2110109c7f515e9756de3477027