AlexsLemonade / alsf-scpca

Management and analysis tools for ALSF Single-cell Pediatric Cancer Atlas data.
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Add cellsnp module to genetic demux #152

Closed jashapiro closed 2 years ago

jashapiro commented 2 years ago

This PR adds two steps to the genetic demux pseudo-workflow (#127). The first is taking the output from STARsolo and mpileup to call genotypes for each cell barcode with cellsnp-lite. The second is to use those genotype calls to assign samples for each cell barcode with vireo.

The first step had some challenges, which I ultimately realized were due to the fact that I was trying to call SNPs for all 10X barcodes. So In this iteration I am using the calls from STARsolo.

Based on the previous STARsolo options, this means that we are using the emptydrops_CR setting to call cells, which should be fine, but we might want to adjust later. We could potentially use a low threshold to include more cells in that set, but we would have to get quite lucky for a low count cell to have sufficient SNPs to identify the sample of origin, so that might not be necessary. As it stands, there were ~13,000 cells identified in the sample that I looked at, and cellsnp/vireo assigned 7,751 (58%) of them as single cells with assigned samples. (201 were assigned as doublets). There is certainly the chance that there would be some changes there with different thresholds, but I have not explored that yet (I seem to recall that the "standard" emptydrops threshold had many more cells, but I would have to look at that)

A few other notable points:

Next steps

Following this PR, I plan to go in two directions: One is to look more closely at the results as they stand to see how they might compare to previous demultiplexing with Seurat (since that made more calls), and generally look at the questions of which samples are getting called vs. not having enough information. I also plan integrate all steps into a single workflow, where we can start with a run or library id, find the relevant bulk samples, then do all mapping and genotyping to calling cells from start to finish. There may need to be a bit more planning on what we want the outputs to look like, and which options to use, but I think we now have all the components we need.

jashapiro commented 2 years ago

In your PR you mentioned that you are using emptydrops_CR call cells and that you are getting ~13,000 cells that are identified in the sample and then cellsnp/vireo is assigning 7,751 of them. I definitely remember you identifying upwards of 50,000 cells using Alevin-fry + filtering with empty Drops or am I interpreting something incorrectly here? I'm just curious as to why this may be giving such drastic differences in results just with identifying the cells.

Yeah, I'm not sure! It could be a difference since these are nuclei, and default emptydrops might not be performing as well, and/or an edge case where there is really a large difference between emptydrops and emptydrops_CR. Or it could be a more general divergence between alevin-fry and starsolo. It is something I may look into a bit more as part of #155.