Trying to reproduce some figures

Dr. Yang (@PYangLab),

I was presenting on the CiteFuse paper for a journal club, and had a question about some of the figures. In particular, the UMAP embeddings from the joint graph.

In the paper, there is a comparison of the effects of doublet detection on the UMAP embeddings and clustering of the dataset. Something that caught my attention here was how different the UMAP plots generated following CiteFuse doublet detection looked compared to those from any other method. One of the most striking differences here was how a group of CD4+ cells was completley seperate in the CiteFuse plots, but not for any other method (other than no-filtering). Additionly, all the other methods' plots look pretty equivalent. This appears in Supp. fig 4:

In my experience, removing a few more cells shouldn't have this strong of an effect on the embedding. So, I tried to reproduce these figures using the code in this repo. I couldn't find the exact code that was used to make the joint graphs for each filtered dataset, so I used code from your vignette. I've included what I ran in the collapsed section below. Due to processing time, I only ran generated fused graphs for 3 of the datasets.

Code to reproduce

```r # Setup load("data/doublet_labels.RData") load("data/Unfiltered.RData") library(CiteFuse) library(scater) library(SingleCellExperiment) library(DT) library(gridExtra) library(purrr) # Make SingleCellExperiment object datasets = list( RNA=as.matrix(rna_mat_control), ADT=as.matrix(adt_control), HTO=as.matrix(hto_control) ) sce = preprocessing(datasets) # DoubletFinder has it's values stored as strings, so I convert it to bool doublet.labels = lapply(doublet.labels, as.logical) # Make subsets, only used some since the fusion takes a while sce_subsets = lapply( doublet.labels[c("Unfiltered", "CiteFuse", "DoubletFinder")], function (labels) {sce[, !labels]} ) ``` ```r # Fuse graphs and compute UMAP embeddings compute_umap = function(sce) { sce <- scater::logNormCounts(sce) sce <- normaliseExprs(sce, altExp_name = "ADT", transform = "log") print("Starting to compute citefuse...") print(system.time(sce <- CiteFuse(sce))) print("starting to compute umap") sce = reducedDimSNF(sce, method = "UMAP", dimNames = "UMAP") sce } sce_processed = lapply(sce_subsets, compute_umap) ```

Output from that process

``` [1] "Starting to compute citefuse..." Calculating affinity matrix Performing SNF user system elapsed 5605.562 22.790 5638.442 [1] "starting to compute umap" [1] "Starting to compute citefuse..." Calculating affinity matrix Performing SNF user system elapsed 3762.252 16.996 3787.311 [1] "starting to compute umap" [1] "Starting to compute citefuse..." Calculating affinity matrix Performing SNF user system elapsed 3658.019 17.448 3683.157 [1] "starting to compute umap" ```

Now generating plots: ```r plots = imap( sce_processed, function (sce, k) { p = visualiseDim( sce, dimNames = "UMAP", colour_by = "CD4", data_from = "altExp", altExp_assay_name = "logcounts", ) + labs(title=k) } ) grid.arrange(plots$Unfiltered, plots$DoubletFinder, plots$CiteFuse, ncol=3) ```

Here are the plots I get:

Since it's been a while since I've used R heavily, I couldn't figure out how to deal with the overplotting, so some points are hidden here. Hovever, I think I can say these plots show different results to what's in the paper. These show a pretty similar plots being generated by CiteFuse and DoubletFinder, with no clear differences in the seperation of CD4+ cells. I'd also note that the embedding which changed the most is the CiteFuse one, which is important since it features heavily in the rest of the paper.

I would like to see your comments on this. How did our plots turn out so different? Was there a signifigant part of the process I changed? In addition, do you think this would have implications for any of the other analyses in the paper - e.g. the clustering analysis?

In case these help with diagnoses on your end:

Here's a link to the SingleCellExperiment objects as an .rds file. And here's some info about the environment I used:

sessionInfo

``` R version 4.0.0 (2020-04-24) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.4 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] stringr_1.4.0 purrr_0.3.4 gridExtra_2.3 DT_0.13 [5] scater_1.16.1 ggplot2_3.3.0 SingleCellExperiment_1.10.1 SummarizedExperiment_1.18.1 [9] DelayedArray_0.14.0 matrixStats_0.56.0 Biobase_2.48.0 GenomicRanges_1.40.0 [13] GenomeInfoDb_1.24.0 IRanges_2.22.2 S4Vectors_0.26.1 BiocGenerics_0.34.0 [17] CiteFuse_1.0.0 loaded via a namespace (and not attached): [1] ggbeeswarm_0.6.0 Rtsne_0.15 colorspace_1.4-1 ggsignif_0.6.0 ellipsis_0.3.1 [6] rio_0.5.16 ggridges_0.5.2 XVector_0.28.0 BiocNeighbors_1.6.0 rstudioapi_0.11 [11] ggpubr_0.3.0 farver_2.0.3 graphlayouts_0.7.0 ggrepel_0.8.2 RSpectra_0.16-0 [16] splines_4.0.0 knitr_1.28 heatmap.plus_1.3 polyclip_1.10-0 alluvial_0.1-2 [21] broom_0.5.6 kernlab_0.9-29 pheatmap_1.0.12 uwot_0.1.8 ggforce_0.3.1 [26] ExPosition_2.8.23 compiler_4.0.0 dqrng_0.2.1 prettyGraphs_2.1.6 backports_1.1.7 [31] assertthat_0.2.1 Matrix_1.2-18 limma_3.44.1 tweenr_1.0.1 BiocSingular_1.4.0 [36] htmltools_0.4.0 tools_4.0.0 rsvd_1.0.3 igraph_1.2.5 gtable_0.3.0 [41] glue_1.4.1 GenomeInfoDbData_1.2.3 reshape2_1.4.4 dplyr_0.8.5 Rcpp_1.0.4.6 [46] carData_3.0-4 cellranger_1.1.0 vctrs_0.3.0 nlme_3.1-147 DelayedMatrixStats_1.10.0 [51] ggraph_2.0.3 xfun_0.14 openxlsx_4.1.5 lifecycle_0.2.0 irlba_2.3.3 [56] statmod_1.4.34 rstatix_0.5.0 edgeR_3.30.0 zlibbioc_1.34.0 MASS_7.3-51.6 [61] scales_1.1.1 tidygraph_1.2.0 hms_0.5.3 rhdf5_2.32.0 RColorBrewer_1.1-2 [66] SNFtool_2.3.0 yaml_2.2.1 curl_4.3 segmented_1.1-0 stringi_1.4.6 [71] randomForest_4.6-14 scran_1.16.0 zip_2.0.4 BiocParallel_1.22.0 rlang_0.4.6 [76] pkgconfig_2.0.3 bitops_1.0-6 lattice_0.20-41 Rhdf5lib_1.10.0 labeling_0.3 [81] htmlwidgets_1.5.1 cowplot_1.0.0 tidyselect_1.1.0 plyr_1.8.6 magrittr_1.5 [86] R6_2.4.1 generics_0.0.2 withr_2.2.0 pillar_1.4.4 haven_2.3.0 [91] foreign_0.8-80 mixtools_1.2.0 survival_3.1-12 abind_1.4-5 RCurl_1.98-1.2 [96] tibble_3.0.1 crayon_1.3.4 car_3.0-8 viridis_0.5.1 locfit_1.5-9.4 [101] grid_4.0.0 readxl_1.3.1 data.table_1.12.8 propr_4.2.6 forcats_0.5.0 [106] digest_0.6.25 tidyr_1.1.0 dbscan_1.1-5 munsell_0.5.0 beeswarm_0.2.3 [111] viridisLite_0.3.0 vipor_0.4.5 ```

Please let me know if you'd like any more information from my end.

Best,

–Isaac

Dear Isaac,

Thank you for your interest in CiteFuse and for raising an important point!

This has recently been been brought to our attention by another reader of our manuscript. We thank both readers for highlighting this apparent discrepancy.

1. I spent some time to find the source of this discrepancy during my correspondence with the first reader who raised this issue. I was able to pinpoint the discrepancy to the way we select select highly variable genes using the scran package.

Between our preprint and the publication of our manuscript, our package was under active review. During this process, we updated our code to use the new hvg selection function selectHVG in scran, replacing the deprecated functions that we originally used for the manuscript trendVar and decomposeVar. Even when the same FDR (0.05) and bio (0.01) values are used, selectHVG now seems to select much fewer hvgs than before.

For example, for the Citefuse-filtered dataset, the number of selected genes are now 228 compared to 680 genes. The reason why your version of UMAP now appears to show similar plots between the different filtered matrices may be because only a third of the hvg are used for the embedding.

To reproduce our findings, please replace the hvg selection method in CiteFuse() with the version used in the manuscript. However, please note these HVG functions have now been defunct in scran pacakge.

version in package

            hvg <- selectHVG(sce,
                             FDR = FDR,
                             bio = bio)

version used in manuscript

            fit <- trendVar(sce, parametric=TRUE, use.spikes = FALSE)
            decomp <- decomposeVar(sce, fit)
            decomp <- decomp[order(decomp$bio, decreasing=TRUE),]
            hvg <- rownames(decomp)[decomp$FDR < FDR & decomp$bio > bio]

Now with more hvgs genes used in the embedding, the separation between the two CD4+ clusters become clearer. I assume the extra 400 genes provide additional information to discriminate these clusters. If you refer to Supplementary Figure 4, and particularly the CD27 expression, the two CD4+ clusters are clearly distinguished by presence and absence of CD27 expression.

I would also like to note that some of the default parameters in the current Biocoundcor version of the package and the ones used in this data are not the same. And for the purposes of demonstrating the effect of hvg selection on UMAP embedding, I have changed only the hvg selection method.

2. You also raised the point that "removing a few more cells shouldn't have this strong of an effect on the embedding." I wonder how many is few cells?

Please see the difference between the total number of filtered cells between the matrices (Supplementary Fig 3b; last set of barplots). I would like to add that, although the overall difference between number of cells between matrices does not appear large (within 100 cells when we look at the difference between CiteFuse vs HTODemux and DoubletFinder), the difference in the actual cells that are filtered (please see Supplementary Figure 3c) are very different. When we observe the venn diagram, it becomes clear that a large number of cells are different between the sets (~nearly 200 cells are different between CiteFuse and HTODemux and more than 350 cells are different between CiteFuse and DoubletFinder, although in total they select comparable amounts). Given that we have a total of ~4100 cells in this dataset and have clusters ranging between 25 and 550 cells, the difference of 200-300 cells (in this case potential doublets that might introduce noise into the data) may be enough to significantly affect the embedding.

Whilst I don't agree with your statement that there are only a few cells that are different between the datasets, I do find that there is much more room for further investigation here.

I hope these points have addressed your questions. Again, we thank you for your interest in our package and for bringing up this point! Please let us know if you any further questions and we would be more than happy to hear about more feedback on our package.

Best regards,

Hani

SydneyBioX / CiteFuse2020Kim

Trying to reproduce some figures #2

version in package

version used in manuscript