AlexsLemonade / training-modules

A collection of modules that are combined into 1-5 day workshops on computational topics for the childhood cancer research community.
Other
61 stars 27 forks source link

scAdvanced intro & cell type dataset selection #563

Closed jashapiro closed 1 year ago

jashapiro commented 2 years ago

The introductory module is intended to review importing preprocessed data, followed by filtering and normalization. For simplicity, it seems logical that we might use the same dataset for the following module, where we demonstrate methods for celltype assignment. To prepare for implementing this modules, we will first need to select a dataset to use.

The main dataset used in this module should probably be the output of Cell Ranger (unfiltered) from a public data set. We could use an alevin-fry output, but I expect that Cell Ranger would be more broadly useful. We can then start with the DropletUtils::read10xCounts() function.

To complete this issue, we should create a notebook and/or scripts in the scRNA-seq-advanced/setup directory that includes the following steps for the chosen dataset:

Following the review of this notebook and the selection of the dataset, we will separate out the various steps, adding detail and commentary for instruction that this initial notebook does not need to include.

Note: We may want to break up this issue into sub-issues. For tracking purposes (as this issue blocks others), we may want to keep this as a meta-issue and make any sub-issues blockers for it.

allyhawkins commented 1 year ago

The table that I mentioned in our meeting that lists publicly available datasets is from an integration review. It looks like the table links to the papers that published each of the datasets so maybe not as useful as I originally thought but here it is just in case: https://www.nature.com/articles/s41587-021-00895-7/tables/2

I also want to point out that of the ones that say CITE-seq is included (if we want to use the same dataset here and for #564), the only one that is both RNA-seq and CITE-seq (rather than ATAC and CITE) takes you to the weighted nearest neighbors paper, which is an option.

Alternatively, we could use the publicly available datasets on 10X. They don't have cell type associated with them, but they do have a few datasets that have a panel of CITE-seq antibodies? Here's one example: https://www.10xgenomics.com/resources/datasets/10-k-pbm-cs-from-a-healthy-donor-gene-expression-and-cell-surface-protein-3-standard-3-0-0

sjspielman commented 1 year ago

@jashapiro

For simplicity, it seems logical that we might use the same dataset for the following module, where we demonstrate methods ~for dataset selection.~

---> for cell-type annotation, right? :) We're not planning to show them how to navigate databases is where my brain first went before I realized this is probably a typo!

jashapiro commented 1 year ago

---> for cell-type annotation

Yes, correct! Updated the text.

sjspielman commented 1 year ago

Noting that from https://www.nature.com/articles/s41587-021-00895-7/tables/2, there are two references with both RNA and surface proteins:

sjspielman commented 1 year ago

Noting from 10x -

EDIT: This is 10000% because I forgot to remove total==0 cells first! Filtering is much friendlier to my computer now, as expected :).

sjspielman commented 1 year ago

I wonder if we might consider using a dataset (already SCE objects) in the scRNAseq package https://bioconductor.org/packages/release/data/experiment/vignettes/scRNAseq/inst/doc/scRNAseq.html#available-data-sets.

A couple of these datasets, based on some ctl+F'ing of the manual, have CITEseq and can be accessed with these functions. The first two seem like better places to start looking.