If you are filing this issue based on a specific GitHub Discussion, please link to the relevant Discussion.
424
Also related to #425
Describe the goals of the changes to the analysis module.
This issue tracks implementing the first steps in the existing (but largely empty!) doublet-detection module. The specific goal is to compare performance on a few "ground truth" datasets used in previous benchmarking studies (including this one), available from this Zenodo repository.
We'll aim to use these four datasets, chosen due to their varying library sizes and putative cell types of origin:
hm-6k (N=6806): Mixture of human HEK293T and mouse NIH3T3 cells
Previous benchmarking showed excellent performance on these datasets potentially since ground-truth doublet annotation focused on species differences
Although this is cell line data which may not be directly comparable to our scpca data*, it does provide a "best case scenario"
pdx-MULTI (N=10296): PDX of human breast cancer, with mouse immune
HMEC-orig-MULTI (N=26426): Human primary mammary epithelial cells
pbmc-1B-dm (N=3790): PBMCs from patient with systemic lupus erythematosus
Previous benchmarking showed poorer performance on this dataset potentially since ground-truth doublet annotation may have considered homotypic doublets
What will your pull request contain?
This issue encompasses two main goals, each of which is expected to be its own PR:
Run doublet detection methods on each dataset using an R and python script. We'll use three methods, each of which operates on a raw counts matrix:
scDblFinder (R)
cxds (R)
Specifically, scDblFindercalculates a version of this this score that is more robust to low sparsity which we'll use here. The main reason we'll use this score is that it's normalized to [0,1], which is much more interprettable than the unbounded scores reported by the scds::cxds() function.
scrublet (python)
(Noting this could be >1 PR) Explore doublet inferences:
Explore distribution of scores and how they relate to the applied threshold (provides insight into the threshold itself)
Measure balanced accuracy for each method at a given chosen threshold
Visualize singlet/doublet calls in PC space
Compare doublet inferences to one another, e.g. with Upset plots and/or Jaccard similarity
Will you require additional software beyond what is already in the analysis module?
Yes - to use scublet, we'll need a conda environment with this package installed.
Will you require different computational resources beyond what the analysis module already uses?
I anticipate that this can be run on a laptop, but if I learn new things I will indicate in the PR.
If known, when do you expect to file the pull request?
If you are filing this issue based on a specific GitHub Discussion, please link to the relevant Discussion.
424
Also related to #425
Describe the goals of the changes to the analysis module.
This issue tracks implementing the first steps in the existing (but largely empty!)
doublet-detection
module. The specific goal is to compare performance on a few "ground truth" datasets used in previous benchmarking studies (including this one), available from this Zenodo repository.We'll aim to use these four datasets, chosen due to their varying library sizes and putative cell types of origin:
hm-6k
(N=6806): Mixture of human HEK293T and mouse NIH3T3 cellspdx-MULTI
(N=10296): PDX of human breast cancer, with mouse immuneHMEC-orig-MULTI
(N=26426): Human primary mammary epithelial cellspbmc-1B-dm
(N=3790): PBMCs from patient with systemic lupus erythematosusWhat will your pull request contain?
This issue encompasses two main goals, each of which is expected to be its own PR:
scDblFinder
(R)cxds
(R)scDblFinder
calculates a version of this this score that is more robust to low sparsity which we'll use here. The main reason we'll use this score is that it's normalized to[0,1]
, which is much more interprettable than the unbounded scores reported by thescds::cxds()
function.scrublet
(python)Will you require additional software beyond what is already in the analysis module?
Yes - to use
scublet
, we'll need a conda environment with this package installed.Will you require different computational resources beyond what the analysis module already uses?
I anticipate that this can be run on a laptop, but if I learn new things I will indicate in the PR.
If known, when do you expect to file the pull request?
First PR expected this week!