AlexsLemonade / OpenScPCA-analysis

An open, collaborative project to analyze data from the Single-cell Pediatric Cancer Atlas (ScPCA) Portal
Other
9 stars 17 forks source link

Explore doublet detection methods #446

Closed sjspielman closed 4 months ago

sjspielman commented 6 months ago

If you are filing this issue based on a specific GitHub Discussion, please link to the relevant Discussion.

424

Also related to #425

Describe the goals of the changes to the analysis module.

This issue tracks implementing the first steps in the existing (but largely empty!) doublet-detection module. The specific goal is to compare performance on a few "ground truth" datasets used in previous benchmarking studies (including this one), available from this Zenodo repository.

We'll aim to use these four datasets, chosen due to their varying library sizes and putative cell types of origin:

What will your pull request contain?

This issue encompasses two main goals, each of which is expected to be its own PR:

  1. Run doublet detection methods on each dataset using an R and python script. We'll use three methods, each of which operates on a raw counts matrix:
    • scDblFinder (R)
    • cxds (R)
      • Specifically, scDblFinder calculates a version of this this score that is more robust to low sparsity which we'll use here. The main reason we'll use this score is that it's normalized to [0,1], which is much more interprettable than the unbounded scores reported by the scds::cxds() function.
    • scrublet (python)
  2. (Noting this could be >1 PR) Explore doublet inferences:
    • Explore distribution of scores and how they relate to the applied threshold (provides insight into the threshold itself)
    • Measure balanced accuracy for each method at a given chosen threshold
    • Visualize singlet/doublet calls in PC space
    • Compare doublet inferences to one another, e.g. with Upset plots and/or Jaccard similarity

Will you require additional software beyond what is already in the analysis module?

Yes - to use scublet, we'll need a conda environment with this package installed.

Will you require different computational resources beyond what the analysis module already uses?

I anticipate that this can be run on a laptop, but if I learn new things I will indicate in the PR.

If known, when do you expect to file the pull request?

First PR expected this week!

sjspielman commented 4 months ago

Closed by #499