scAdvanced intro & cell type dataset selection

jashapiro commented 2 years ago

The introductory module is intended to review importing preprocessed data, followed by filtering and normalization. For simplicity, it seems logical that we might use the same dataset for the following module, where we demonstrate methods for celltype assignment. To prepare for implementing this modules, we will first need to select a dataset to use.

The main dataset used in this module should probably be the output of Cell Ranger (unfiltered) from a public data set. We could use an alevin-fry output, but I expect that Cell Ranger would be more broadly useful. We can then start with the DropletUtils::read10xCounts() function.

To complete this issue, we should create a notebook and/or scripts in the scRNA-seq-advanced/setup directory that includes the following steps for the chosen dataset:

Downloading the data from a public source. Ideally this would include the code for download, but a link to a webpage where the data can be downloaded may be sufficent
Importing the data into a SingleCellExperiment object
Filtering with emptyDropsCellRanger and miQC
PCA (with HVG modeling) & UMAP
plotting UMAP
Some form of cell type identification, as an initial test. This does not need to be finalized at this point, but we do want to verify for any selected dataset that we can perform cell type identification.

Following the review of this notebook and the selection of the dataset, we will separate out the various steps, adding detail and commentary for instruction that this initial notebook does not need to include.

Note: We may want to break up this issue into sub-issues. For tracking purposes (as this issue blocks others), we may want to keep this as a meta-issue and make any sub-issues blockers for it.

allyhawkins commented 1 year ago

The table that I mentioned in our meeting that lists publicly available datasets is from an integration review. It looks like the table links to the papers that published each of the datasets so maybe not as useful as I originally thought but here it is just in case: https://www.nature.com/articles/s41587-021-00895-7/tables/2

I also want to point out that of the ones that say CITE-seq is included (if we want to use the same dataset here and for #564), the only one that is both RNA-seq and CITE-seq (rather than ATAC and CITE) takes you to the weighted nearest neighbors paper, which is an option.

Alternatively, we could use the publicly available datasets on 10X. They don't have cell type associated with them, but they do have a few datasets that have a panel of CITE-seq antibodies? Here's one example: https://www.10xgenomics.com/resources/datasets/10-k-pbm-cs-from-a-healthy-donor-gene-expression-and-cell-surface-protein-3-standard-3-0-0

sjspielman commented 1 year ago

@jashapiro

For simplicity, it seems logical that we might use the same dataset for the following module, where we demonstrate methods ~for dataset selection.~

---> for cell-type annotation, right? :) We're not planning to show them how to navigate databases is where my brain first went before I realized this is probably a typo!

jashapiro commented 1 year ago

---> for cell-type annotation

Yes, correct! Updated the text.

sjspielman commented 1 year ago

Noting that from https://www.nature.com/articles/s41587-021-00895-7/tables/2, there are two references with both RNA and surface proteins:

Ref 11 data is hashed, which I think we'd like to avoid
Ref 74's data is either pre-filtered sparse count matrices in GEO, or BAM files in SRA, neither of which are great starting points for this data.

sjspielman commented 1 year ago

Noting from 10x -

https://www.10xgenomics.com/resources/datasets/10-k-pbm-cs-from-a-healthy-donor-gene-expression-and-cell-surface-protein-3-standard-3-0-0 has over 6 million cells and filtering attempts are way too much for my computer to handle, so this one is out.
https://www.10xgenomics.com/resources/datasets/10-k-cells-from-a-malt-tumor-gene-expression-and-cell-surface-protein-3-standard-3-0-0 similar situation with seriously excessive runtime and RStudio-crashing for filtering. I'm seeing similar sizes for other ADT-containing datasets. The filtered versions of these data are already down to <10,000 cells, so if we start with one of those then there would not be much/any filtering for them to do.

EDIT: This is 10000% because I forgot to remove total==0 cells first! Filtering is much friendlier to my computer now, as expected :).

sjspielman commented 1 year ago

I wonder if we might consider using a dataset (already SCE objects) in the scRNAseq package https://bioconductor.org/packages/release/data/experiment/vignettes/scRNAseq/inst/doc/scRNAseq.html#available-data-sets.

A couple of these datasets, based on some ctl+F'ing of the manual, have CITEseq and can be accessed with these functions. The first two seem like better places to start looking.

KotliarovPBMCData()

Kotliarov, Y., R. Sparks, A. Martins, M. Mulè, Y. Lu, M. Goswami, L. Kardava, et al. 2020. “Broad Immune Activation Underlies Shared Set Point Signatures for Vaccine Responsiveness in Healthy Individuals and Disease Activity in Patients with Lupus.” Nat. Med. 26 (4): 618–29.

MairPBMCData()

Mair, F., J. R. Erickson, V. Voillet, Y. Simoni, T. Bi, A. J. Tyznik, J. Martin, R. Gottardo, E. W. Newell, and M. Prlic. 2020. “A Targeted Multi-omic Analysis Approach Measures Protein Expression and Low-Abundance Transcripts on the Single-Cell Level.” Cell Rep 31 (1): 107499.

StoeckiusHashingData(beware, hashing, and only "mostly human")

Stoeckius, M., S. Zheng, B. Houck-Loomis, S. Hao, B. Z. Yeung, W. M. Mauck, P. Smibert, and R. Satija. 2018. “Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics.” Genome Biol. 19 (1): 224.

AlexsLemonade / training-modules

scAdvanced intro & cell type dataset selection #563