AlexsLemonade / OpenScPCA-analysis

An open, collaborative project to analyze data from the Single-cell Pediatric Cancer Atlas (ScPCA) Portal
Other
1 stars 8 forks source link

Doublets: Write script to process ScPCA samples #554

Open sjspielman opened 6 days ago

sjspielman commented 6 days ago

If you are filing this issue based on a specific GitHub Discussion, please link to the relevant Discussion.

364

Describe the goals of the changes to the analysis module.

Our next step in the doublet-detection module will be run doublet detection on ScPCA data. Specifically, we have discussed running scDblFinder on ScPCA data with the following approach:

What will your pull request contain?

Two scripts: the R script to run scDblFinder, and the shell script to call it.

The overall run module shell script will also be modified so that you can indicate whether the module shoud be run in "benchmark" or "process" (?) mode, where the former will run benchmarking scripts/notebooks, and the latter will run doublet detection on ScPCA libraries.

Will you require additional software beyond what is already in the analysis module?

There will be no changes in dependencies.

Will you require different computational resources beyond what the analysis module already uses?

There will be no changes in compute - it can still be run on a laptop.

If known, when do you expect to file the pull request?

I expect to start this ~next sprint, which starts July 15th. So, we can expect a PR in probably 3ish weeks.~ towards the end of this sprint. We can expect a PR in 2ish weeks.

jashapiro commented 6 days ago

The overall run module shell script will also be modified so that you can indicate whether the module shoud be run in "benchmark" or "process" (?) mode, where the former will run benchmarking scripts/notebooks, and the latter will run doublet detection on ScPCA libraries.

I would probably recommend keeping this as two separate scripts. You can call both from the test action.

sjspielman commented 6 days ago

Noting that for multiplexed libraries, the script will need to take multiple samples into consideration: https://bioconductor.org/packages/release/bioc/vignettes/scDblFinder/inst/doc/scDblFinder.html#multiple-samples

jashapiro commented 6 days ago

Noting that for multiplexed libraries, the script will need to take multiple samples into consideration: https://bioconductor.org/packages/release/bioc/vignettes/scDblFinder/inst/doc/scDblFinder.html#multiple-samples

Given that we are no good at defining sample of origin for the current multiplexed samples in the portal, I would see what happens if we just run it straight.

allyhawkins commented 6 days ago

From the link you posted:

If you have multiple samples (understood as different cell captures), then it is preferable to look for doublets separately for each sample (for multiplexed samples with cell hashes, this means for each batch).

My interpretation of this is that you should not indicate multiple samples and should not need to do anything differently. I believe a batch here refers to a single library that contains the multiplexed samples. A main source of doublets is when you go through the barcoding process during sequencing prep, which would be done on a library and not sample level. I would have this run on each library that we have.