Test Cell Assign against known cell types in sarcoma

allyhawkins commented 1 year ago

Closes #218

This PR does some additional testing regarding CellAssign to specifically look at how CellAssign performs in a non blood cell tissue type at assigning cell types. To address this, I started with a library from the Rhabdoymyosarcoma project where we have defined cell type annotations from the submitter. I wanted to compare the cell type annotations that we have to those assigned using CellAssign with two different marker gene lists.

I used the following marker gene lists as references:

CellMarker2.0 - In looking at the methods of the manuscript, the marker list here was compiled based on manual literature searches. Here I used the Human cell markers file since it also contains the contents of the single cell only markers. The one thing I did like about this list is that we have cell ontology ID's for the cell types.
PanglaoDB - Here I downloaded using the get tsv file button which grabs the entire marker gene list. This list appears less comprehensive and we only have gene symbols. Also it hasn't been updated since 2020 so that's a downfall.

To use CellAssign, the marker gene lists need to be transformed into a gene by cell type binary matrix to indicate which genes belong to which cell type. I did this for both lists and then also added an "other" column where no genes are found. Any cells assigned to "other" would be the same as saying none of these cell types fit so we can't assign it.

After running CellAssign with both references I compared the new cell annotations to the original using heatmaps. I did one where I grouped all tumor together and one where I broke up the tumor because I did notice that CellAssign was struggling with the tumor cells and does much better with normal cell types. This doesn't really surprise me though. With CellMarker2.0 the tumor cells are assigned as macrophages though and none of those genes are really expressed in the dataset, so that does worry me.

The last thing I did was definitely a bit circular, but my hope was that it would show me strong agreement between the assignments from CellAssign and original assignments. I used scran::FindMarkers() to identify the marker genes for each cell type assigned in our original dataset. Then I took the top 10 for each cell type to create the reference gene by cell type binary matrix. When I used that as a reference with CellAssign I saw much more consistent agreement between the cell type annotations, particularly with the normal cell types. However, I still noticed that CellAssign struggled with the tumor cell types and classified most of them as other when they should have been Tumor_myoblast.

A few things I learned from this and where I think we should go next:

CellAssign does like to not assign cell types when it doesn't think any of the cell types available are appropriate BUT this is still dependent on having a good quality reference.
Classifying tumor cells is hard (should we be using CNV inference...)
The time it takes to run CellAssign increases with the number of marker genes used/ size of the reference. In our initial testing we had ~ 5-10 genes and 4 cell types it took 2 minutes to run. These marker gene matrices had up to 260 genes which took an hour to run 😬 I was running on just 1 cpu though so I would assume it gets faster as we increase that.
No matter how great the method, if the reference is bad, the assignment is bad. I personally feel comfortable enough to continue with implementing CellAssign and SingleR with reporting two annotations. I think where we are going to need to do some tweaking is on what the appropriate references will be for each dataset. I also think the lists available on either CellMarker/ Panglao are good starting points but could be pared down and catered to each of our projects. Regardless of this... tumor cells are just going to be difficult...

One small note is that when doing this I turned the cell-assign.py into a real script with arguments and changed how we run it in the previous notebook as well. I also have the python calls commented out in this notebook for rendering because it takes a long time to run locally.

Here's the report: cell-assign-sarcoma.nb.html.zip

allyhawkins commented 1 year ago

My other questions are a bit more about what the references look like before we try to use them for cell typing. How many markers are we using for some of these? Can that explain our lack of power? Making some tables of the cell types & numbers of markers from the marker DBs could be informative here, I think.

To address this I added tables to show the number of markers for each cell type in each of the databases. The number of markers seems to be as low as 1 and as high as 180, depending on the cell type and database. It looks like overall PangaloDB as more marker genes per cell type, at least for the muscle and connective tissues.

One further question: if you look at the cellontology_ids for the cancer types within muscle, do they show up anywhere else? As in, are we losing data by filtering to muscle first, when other datasets might give different markers for the same cell ontology id? This seems less likely, but it could be a concern.

I also added tables that might clarify the inclusion of the cancer types in the marker gene database for CellMarker. I looked at all the cancer types that contain the word "sarcoma" and made a table of the cell types and associated marker genes. We do see that some of the cell types are duplicated across cancers and have different numbers of marker genes. This tells me that the marker genes used for a myoblast in Osteosarcoma are not the same as in RMS, which concerns me.

Can you label the "macrophages" on this plot? I kind of doubt they are where this gene is expressed. It's all very odd.

I updated the plot showing the expression of the macrophage marker genes to be split by cell type. In addition to that, I chose to look at some of the marker genes for Myoblasts, since that cell type is also included in the CellMarker reference we used but no cells are assigned to it. I took a look at the paper with the original RMS data and they had pulled out MSC, MYF5, and PAX7 as myoblast markers in their dataset. These markers are also included in the marker gene list for myoblasts from CellMarker. I took a look at the expression of those genes and noticed that the expression seemed fairly low in this particular sample. Maybe we should consider running this on other samples to see how consistent these results are.

I might also move the retyping from submitter cell types to a different notebook: I feel like the exploration there is of a different character than the two published databases.

I did break this up into two notebooks and I'm going to file a separate PR for using the marker genes as a reference. In doing this, I needed to break out the functions for getting the cell assignments from the CellAssign predictions to a separate file I can source into both notebooks.

Attached is an updated version of the notebook: cell-assign-sarcoma.nb.html.zip

allyhawkins commented 1 year ago

I made a few more changes based on your most recent comments so this is ready for another round of review.

I provided detailed information on the marker gene lists for each database, including how they were created and organized. Or at least I filled in as much as I could gather from the websites and papers.
Instead of UMAPs, I utilized scater::plotExpression to display the gene expression of marker genes among different cell types.
I added a conclusion section with some general comments on the findings. I, too am worried about the macrophage assignment and find it a little concerning. If we combined all the macrophage markers from all tissue types in CellMarker2.0, would CellAssign do a better job? But I worry we would get the same result as we did with PanglaoDB, where most of the cells are designated as "other".

Attached is the updated notebook: cell-assign-sarcoma.nb.html.zip

allyhawkins commented 1 year ago

I checked the order before adding the cell type assignments to the processed_sce and re-ran the notebook. We are still getting the same results, so unfortunately, I don't think that was it.

I wonder if this is something we can try to work around? Can we use the tissue type to get our cell type list, but then use all markers for that cell type regardless of tissue? That seems potentially dicey, but also not crazy to me. I don't know how to evaluate it, except by feel.

I don't think this idea is that wild, but I'm unsure how we would evaluate it (like you said). I think I like the approach of combining organs from PanglaoDb rather than altering the marker gene lists. I don't have a strong rationale, but not altering the lists seems safer, considering we don't fully understand how they were created. We would lose out on the ontology ID benefit of CellMarker, which is unfortunate.

jashapiro commented 1 year ago

We would lose out on the ontology ID benefit of CellMarker, which is unfortunate.

I actually think we can probably do a pretty good mapping to cell ontology if we want to do this. We might have to do some manual checking, but I bet we can make much of it pretty automatic with a little effort

AlexsLemonade / sc-data-integration

Test Cell Assign against known cell types in sarcoma #221