AlexsLemonade / scpca-portal

Single-cell Pediatric Cancer Atlas Portal is a growing database of uniformly processed single-cell data from pediatric cancer tumors and model systems
https://scpca.alexslemonade.org
BSD 3-Clause "New" or "Revised" License
3 stars 0 forks source link

Gene Lists from cell type clusters #760

Open dvenprasad opened 1 month ago

dvenprasad commented 1 month ago

Context

During eval, users looking at the SingleR and CellAssign cell type predictions were confused because their predictions differed so much. They were trying to figure out which one they could trust. Some even said, they would just run it with their cell typing method.

Problem or idea

They said they would like to see top 10/15 genes for each of the cell type cluster on the UMAP, so they can validate the calls made by the cell typing methods.

Solution or next step

Tagging @allyhawkins / @jashapiro for feasibility and go/no go decision

allyhawkins commented 1 month ago

They said they would like to see top 10/15 genes for each of the cell type cluster on the UMAP, so they can validate the calls made by the cell typing methods.

I'm not quite sure how much of a priority this should be. Part of including both methods and the report is actually to encourage users to validate the methods before blindly trusting the annotations we provide. The methods we are using are far from perfect at this point.

Also, I'm not sure which top genes they are referring to. Is it the top genes used to identify each of the cell types in the first place? For CellAssign this would mean plotting the marker genes used, which for some cell types is 100s of genes. For SingleR this would mean pulling out the gene lists from the SingleR reference object. Again, the number of genes could be in the 100s.

Alternatively, it could mean taking all of the cells assigned to a specific cell type and performing marker gene analysis to identify the marker genes associated with those cells vs. all other cells in the dataset. Then plotting the top genes from that analysis. Then if you have knowledge about the cell types you could look to see if the genes showing up match up with the marker genes you expect for that cell type. I think this is much more reasonable and feasible to implement, however, it still relies on the user to know what genes are expected and explore the data to validate cell types on their own. I think it's something we could add but I also don't think its a priority.

allyhawkins commented 1 month ago

@dvenprasad The science team discussed this and we agree that this will require more analysis and exploration and doesn't quite make sense at this point. So for now, we will hold off on implementing this.

dvenprasad commented 1 month ago

Yes, that sounds reasonable to me. These questions came up with the more computationally savvy folks and they also have the skills to extract the genes themselves, so this isn't "hindering" anyone.

To clarify your question :

Also, I'm not sure which top genes they are referring to. Is it the top genes used to identify each of the cell types in the first place?

This came up when they were looking at the cell types on the UMAP. They want to know what are the top genes for the data points in each of the cell types colored on the UMAP. So yes, I think we are talking about the same thing.