Dive into SingleR and comparing references with and without immune cells

Begins to address #204

This PR begins some exploration of cell type assignment with SingleR, specifically I spent some time trying to compare cell type assignments of some AML samples between references with and without immune cells. In doing this I also spent some time just looking at the stats that come out of SingleR and how we might be able to use them to compare across samples, specifically the score and the delta.

Some things to keep in mind as you go through this notebook:

Other than the initial stacked bar charts where I looked at the overall cell type label assigned across each reference for each library that we have, I mainly focused on the Blueprint and the HPCA references. The reason for this is because those are the ones that we have a corresponding reference where we removed immune cell types before building/training the model and running SingleR. I think the main question that this notebook serves to answer is how SingleR annotates cells when we have a negative or incorrect reference. There are additional follow up questions such as which of the immune cell type references is the correct or best reference to use, and that is not what I'm after here. You will see that each reference gives a slightly different annotation for the most part, and that is due to the differences in the original reference so I didn't want to get too far into that here.
For some plots I looked at all libraries, using purrr to generate the same plot across each library, but for a few, I just made the plot for a single library. Mostly because the notebook was getting long and similar patterns were seen across all libraries.

Some overall conclusions and things that I learned here:

SingleR really wants to make an assignment, regardless of if its right. If there are celltypes in the reference and if those celltypes contain marker genes that are correlated with cells in our test dataset, then SingleR will assign it. Also the score metric is not easily compared across samples as it is very easily influenced by batch effects. As noted in the annotation diagnostics section of the SingleR book, the magnitude of differences between the scores have no interpretation. Only the pre-tuned scores are comparable across labels, but not across samples.
The magnitude of the delta, the difference between the score for the assigned label pre fine-tuning and the median score across all labels, is less sensitive to batch effect and might be a good metric we can use for comparing cell type annotation across samples to identify a "good" annotation (and maybe for the label shuffling test). However, if you had a reference that was more granular, like HPCA that seemed to toggle cells between CMP/GMP progenitor states, the delta was much lower because cells appeared to get high scores for both of those cell types. So again... reference is really important.
This may seem obvious, but I think the biggest influence on cell type assignment is the marker gene lists and how those change from reference to reference. I noticed here that by removing immune cells we actually changed the marker gene lists that were specific to non-immune cells. This meant that adipocytes and endothelial cells marker gene lists both contained genes that were previously marker genes for HSCs or Monocytes when immune cells were included, which probably explained why SingleR now classified cells as these other non-immune cell types.

One big thought I had for a future analysis was to actually try some of the methods of combining references suggested by SingleR. That may help us identify a consensus for immune celltypes, rather than dealing with 4 separate references that all have slightly different levels of annotations.

Here's a copy of the current rendered notebook for reviewing. SingleR-non-immune-ref-comparison.nb.html.zip

@allyhawkins Just want to give you a sense of my time here - I am almost certainly not going to be able to get to this until after Research Focus next week, if that changes anything for you (though I don't know that others' availability is necessarily better..!)

I went ahead and requested @jashapiro also, but really whoever is able to review is fine with me.

@jashapiro Thanks so much for the feedback and for the helpful discussion last week. I also had some discussion with Jackie around what types of things we would like to look at so made changes accordingly. In general, the goal here is thinking about what things we can show in a QC report to help users evaluate if the reference(s) we have used are appropriate or not.

I did a pretty big revamp of this based on feedback and made the following changes:

Added the library ID as a param to the notebook, so everything here now only is looking at a single library.
Changed the stacked bar plots to show the number of cells rather than proportion and only showed cells who received labels, removing NA, so it's easier to see any differences in the number of cells that are able to be classified.
For the delta plots I looked at the distribution of delta across all celltypes, rather than faceting by cell type. In one plot I faceted across label types, looking at the delta distribution for each reference/ label type combination (main, fine, or ont). I also labeled the cells on if they are pruned or not.
I also made the plot where I colored by cell type, but I thought this was less informative.
I calculated the delta between the highest score and the median score as we discussed and plotted that distribution, and that showed the most clear differences between the immune references and the non-immune references.
I looked at total score distribution, but I'm hesitant to conclude anything from this without correcting for comparing across references.
I added in a comparison to graph based clustering, based on the section in the SingleR book for annotation diagnostics. Here I calculated the ARI between cluster assignments and labels for each reference. These values are quite low for all of the references, but I would expect that given the fact that these cells are quite uniform and mostly made up of HSCs and progenitors. The references we are using don't further break them down for the most part, so it's not likely that this would line up with clustering if more than one cluster was identified (which is the case for this library). I included the UMAPs with cluster annotations vs. labels to help show why there is this disagreement.
The last thing I did was create some heatmaps to compare across references. One thing that Jackie and I had discussed was creating a heatmap where we look at celltype labels across references and see how many cells are getting classified into similar types across different references. I made a heatmap looking at all references, after doing some collapsing of similar labels for cleaning things up. Then I did some direct comparisons to look at how two references directly annotate cells differently. For example the Blueprint encode dataset classifies most cells as HSCs, while those same cells are classified as CMP and GMPs in HPCA.

Here's a copy of the latest rendered notebook. SingleR-non-immune-ref-comparison.nb.html.zip

Thanks for the comments @jashapiro. I made the smaller changes, but didn't make any real plot changes because I think some of them are minor comments that we could address if including those plots in the QC report. For the heatmap comment, I think if we are going to include a heatmap of scores in the QC report, we would want to create our own custom heatmap so we could make sure the scales are the same.

Regarding some comments about looking into the delta.next and why it might be higher than expected, I did do a little digging into the codebase but I'm still a little confused. Partially because the actual code that calculates and returns the delta.next is C++ which is not my forte. I can dig in more if we need to, but I think we've made the decision to focus on the delta median distribution (or other similar distributions). For your reference though, this is the part of the classifySingleR code where the delta is calculated, and it refers to a function that is found in the RcppExports.R file. In looking at the actual function, I think this line might be responsible for getting the delta.next, but I honestly would have to do a lot more googling about C++ to figure that out for sure. https://github.com/LTLA/SingleR/blob/5e4daf8b3db5068d8ed317ec42d5d13df1edd7ca/src/run.cpp#L28

The last thing I wanted to address was your comment about looking at different distributions of computing a delta score. I think we do want to do this, but maybe outside this PR and in a separate notebook.

Regarding some comments about looking into the delta.next and why it might be higher than expected, I did do a little digging into the codebase but I'm still a little confused. Partially because the actual code that calculates and returns the delta.next is C++ which is not my forte. I can dig in more if we need to, but I think we've made the decision to focus on the delta median distribution (or other similar distributions).

Yeah, I just tried to look at that code, and it is pretty opaque! Finding where the delta code actually lives from a mess of C++ references is not easy at all! Not to mention this is not bare C++, but RCpp, which has its own little nuances.

But even worse, I think the real code lives in another repo: https://github.com/LTLA/singlepp/blob/99b5c62d9c65703e30b29feb449e07dcde8519ff/tests/src/naive_method.h#L50-L51

(Which does seem to be doing the delta.next as expected.) So I'm not sure why the median we are getting is smaller. I think I might look at some of the cases where delta.next is particularly large and see what the distribution of scores looks like for those cases.

AlexsLemonade / sc-data-integration

Dive into SingleR and comparing references with and without immune cells #210