If you are filing this issue based on a specific GitHub Discussion, please link to the relevant Discussion.

424

Also related to #425

Describe the goals of the changes to the analysis module.

This issue tracks implementing the first steps in the existing (but largely empty!) doublet-detection module. The specific goal is to compare performance on a few "ground truth" datasets used in previous benchmarking studies (including this one), available from this Zenodo repository.

We'll aim to use these four datasets, chosen due to their varying library sizes and putative cell types of origin:

hm-6k (N=6806): Mixture of human HEK293T and mouse NIH3T3 cells
- Previous benchmarking showed excellent performance on these datasets potentially since ground-truth doublet annotation focused on species differences
- Although this is cell line data which may not be directly comparable to our scpca data*, it does provide a "best case scenario"
pdx-MULTI (N=10296): PDX of human breast cancer, with mouse immune
HMEC-orig-MULTI (N=26426): Human primary mammary epithelial cells
pbmc-1B-dm (N=3790): PBMCs from patient with systemic lupus erythematosus
- Previous benchmarking showed poorer performance on this dataset potentially since ground-truth doublet annotation may have considered homotypic doublets

What will your pull request contain?

This issue encompasses two main goals, each of which is expected to be its own PR:

Run doublet detection methods on each dataset using an R and python script. We'll use three methods, each of which operates on a raw counts matrix:
- scDblFinder (R)
- cxds (R)
  - Specifically, scDblFinder calculates a version of this this score that is more robust to low sparsity which we'll use here. The main reason we'll use this score is that it's normalized to [0,1], which is much more interprettable than the unbounded scores reported by the scds::cxds() function.
- scrublet (python)
(Noting this could be >1 PR) Explore doublet inferences:
- Explore distribution of scores and how they relate to the applied threshold (provides insight into the threshold itself)
- Measure balanced accuracy for each method at a given chosen threshold
- Visualize singlet/doublet calls in PC space
- Compare doublet inferences to one another, e.g. with Upset plots and/or Jaccard similarity

Will you require additional software beyond what is already in the analysis module?

Yes - to use scublet, we'll need a conda environment with this package installed.

Will you require different computational resources beyond what the analysis module already uses?

I anticipate that this can be run on a laptop, but if I learn new things I will indicate in the PR.

If known, when do you expect to file the pull request?

First PR expected this week!

AlexsLemonade / OpenScPCA-analysis

Explore doublet detection methods #446