Open dvenprasad opened 1 month ago
They said they would like to see top 10/15 genes for each of the cell type cluster on the UMAP, so they can validate the calls made by the cell typing methods.
I'm not quite sure how much of a priority this should be. Part of including both methods and the report is actually to encourage users to validate the methods before blindly trusting the annotations we provide. The methods we are using are far from perfect at this point.
Also, I'm not sure which top genes they are referring to. Is it the top genes used to identify each of the cell types in the first place? For CellAssign
this would mean plotting the marker genes used, which for some cell types is 100s of genes. For SingleR
this would mean pulling out the gene lists from the SingleR
reference object. Again, the number of genes could be in the 100s.
Alternatively, it could mean taking all of the cells assigned to a specific cell type and performing marker gene analysis to identify the marker genes associated with those cells vs. all other cells in the dataset. Then plotting the top genes from that analysis. Then if you have knowledge about the cell types you could look to see if the genes showing up match up with the marker genes you expect for that cell type. I think this is much more reasonable and feasible to implement, however, it still relies on the user to know what genes are expected and explore the data to validate cell types on their own. I think it's something we could add but I also don't think its a priority.
@dvenprasad The science team discussed this and we agree that this will require more analysis and exploration and doesn't quite make sense at this point. So for now, we will hold off on implementing this.
Yes, that sounds reasonable to me. These questions came up with the more computationally savvy folks and they also have the skills to extract the genes themselves, so this isn't "hindering" anyone.
To clarify your question :
Also, I'm not sure which top genes they are referring to. Is it the top genes used to identify each of the cell types in the first place?
This came up when they were looking at the cell types on the UMAP. They want to know what are the top genes for the data points in each of the cell types colored on the UMAP. So yes, I think we are talking about the same thing.
Context
During eval, users looking at the SingleR and CellAssign cell type predictions were confused because their predictions differed so much. They were trying to figure out which one they could trust. Some even said, they would just run it with their cell typing method.
Problem or idea
They said they would like to see top 10/15 genes for each of the cell type cluster on the UMAP, so they can validate the calls made by the cell typing methods.
Solution or next step
Tagging @allyhawkins / @jashapiro for feasibility and go/no go decision