Script to normalize CITE-Seq data

sjspielman commented 2 years ago

Related to #181 Closes #185

This PR adds a script 00e-process-scpca-citeseq.R to filter and normalize ADT counts before proceeding to the downstream analyses pipeline. As part of this review, we should discuss both logistical approaches (namely how I handled directories) and the actual steps taken to process the ADT counts.

I updated scripts/README.md to explain that this script should be run after 00d-obtain-scpca-sce.R but before the downstream analyses pipeline. This script will read in all files from the data/scpca/filtered_sce/ directory and check if they have CITE-Seq. If they do, then the data is filtered and normalized, and ultimately saved in results/scpca/filtered_sce/, which should now be used as the input directory for downstream analyses. If they do not have CITE-seq, we still export the unprocessed SCE to results/scpca/filtered_sce/ for consistency. A different approach here could be to integrate the cite-seq processing into preprocess-sce.R, which would remove the need for this stand-alone script wholesale and the need for extra results directories. However, this step may be specific to ScPCA so it may not make sense to include there. Also, we could have a column in the library file has_CITEseq and just process those, which would remove the need to read in and check for CITE-seq.

For the actual ADT processing, I followed OSCA when possible and went rogue otherwise!

Following this section, I used DropletUtils::cleanTagCounts() to identify cells to remove.
- We don't have any protein controls, so nothing to do for other OSCA suggestions
Remove those cells from the SCE entirely (RNA and ADT)
Calculate ambient baseline and size factors, which should never be zero now that cells have been removed.
However, sometimes it is still zero, for fun? ¯_(ツ)_/¯ So, I busted out a while loop for the first time in forever to keep purging cells with 0 size factors until we aren't 0 anymore. This is my rogue coding! Extremely open for discussion!
Once we have all positive size factors, I do the actual normalization as well as some sanity checks.
Write the final SCE to results/scpca/filtered_sce/ which will serve as input for downstream

Note that my repeat logic at the bottom maybe could use some help...

Finally, one additional incidental change in 00d-obtain-scpca-sce.R. I think there's a bug there so I swapped out a variable to fix it. Confirm?

sjspielman commented 2 years ago

So I think we actually want to be working with objects that go through downstream analyses first (meaning bad cells are removed and RNA-seq assay is normalized), then those libraries that have corresponding CITE-seq go through further processing.

This makes sense, and bonus is also not annoying! ;) I'll see how the ADT count situation changes with the appropriately processed input files, stay tuned.

sjspielman commented 2 years ago

@allyhawkins As I revisit this, I'm curious to hear your thoughts about structuring directories. Part of me desperately wants to overwrite files saved in scpca-downstream-analyses/, while part of me sees this as a horrible plan 😂 Currently I'm sketching this out to write new files to a results directory scpca-downstream-analyses-citeseq/, where files are either simply copied (if no citeseq) or updated and written out (with citeseq). But! If, as we've lightly discussed over in #183, the sce_dir argument may be going away, then maybe it's very fair to not copy files that don't need further processing, since directories for where filtered libraries are stored will be going into metadata?

Please also note there are a lot of things called "filtered" at this point and my brain is starting to crack, so maybe I'm just confusing myself with paths. 🥴

allyhawkins commented 2 years ago

Currently I'm sketching this out to write new files to a results directory scpca-downstream-analyses-citeseq/, where files are either simply copied (if no citeseq) or updated and written out (with citeseq). But! If, as we've lightly discussed over in https://github.com/AlexsLemonade/sc-data-integration/pull/183, the sce_dir argument may be going away, then maybe it's very fair to not copy files that don't need further processing, since directories for where filtered libraries are stored will be going into metadata?

I definitely don't think we need to be copying any files that don't have CITE-seq, that seems like a waste of space that we should be able to avoid. Given that we are going to be adding the input directory to be required in the metadata file, I think we can specifically store the SCEs with CITE-seq that have been further processed in a new directory and specify using that for input to integration.

sjspielman commented 2 years ago

@allyhawkins I'm coming back to revise this shortly, and want to note that in https://github.com/AlexsLemonade/sc-data-integration/pull/184/commits/f449ad39c90c8841232c00749062b530485ad7e2 I ended up refactoring code a bit when merging in main in order to address conflicts. Specifically, I updated the new function I added find_downstream_sce_files() can handle multiple SCE directories, and now incorporate that into 02-prepare-merged-sce.R. I also moved your small helper function perform_file_search() into integration-helpers.R. Currently it is only used when called from find_downstream_sce_files(), but I imagine it could be helpful to have in case of other use cases that come up! Please let me know if you have feedback on these choices when I re-request review (not ready yet I think).

allyhawkins commented 2 years ago

Please let me know if you have feedback on these choices when I re-request review (not ready yet I think).

Okay sounds good! I'll wait for the review re-request to take a look at everything then.

sjspielman commented 2 years ago

@allyhawkins ready for a look now!

The metadata slot now includes a column citeseq_percent_cells_removed, and I now print out a warning if the while loop (which is still present, for now) is entered. Regarding your comment:

So from this maybe we should be removing low abundant ADTs not cells with size factor of 0 to combat the problem if we move forward with the while loop. It looks like we could use librarySizeFactors, but there is an additional note about working with ADT data that recommends against using that.

I feel like they make a good argument about not using librarySizeFactors due to biases, so I'm hesitant to go that direction. Something we might consider is a threshold for throwing an error if "too many" size factors are 0 after the filtering step, but I don't have a clear sense of how to choose this in a principled manner besides just saying "eh how about 1% of cells"...

sjspielman commented 2 years ago

@allyhawkins I've implemented this ADT check, and sadly it doesn't help. I included a command line argument for adt_threshold and played around with different values 1, 2, 10, 20, 30, 40, 50. <=40 still have 2 cells in SCPCL000296 with 0 size factors, and at 50 there are no more 0 size factors but we lose a lot of ADTs!

Currently the script will entirely error out if 0's are encountered. Now I'm thinking the way to go here could be, rather than iteratively calculated size factors with while until all are >0, maybe we just do the calculation once and remove cells with 0.

allyhawkins commented 2 years ago

@allyhawkins I've implemented this ADT check, and sadly it doesn't help. I included a command line argument for adt_threshold and played around with different values 1, 2, 10, 20, 30, 40, 50. <=40 still have 2 cells in SCPCL000296 with 0 size factors, and at 50 there are no more 0 size factors but we lose a lot of ADTs!

You are using the output from downstream analyses right? I'm confused why you are seeing failures? Running the below code on my end works for me without errors... and only removes 1 ADT that has very low expression (mean 0.005).

sce_dir <- "results/scpca/scpca-downstream-analyses"
sce <- readr::read_rds(file.path(sce_dir, "SCPCS000222", "SCPCL000296_miQC_processed_sce.rds"))

retain_barcodes <- DropletUtils::cleanTagCounts(altExp(sce)) %>%
  tibble::as_tibble(rownames = "cell_barcode") %>%
  dplyr::filter(discard == FALSE) %>%
  dplyr::pull(cell_barcode)

retain_adts <- as.data.frame(rowData(altExp(sce))) %>%
  tibble::rownames_to_column("ADT") %>%
  dplyr::filter(detected > 1 & mean > 1) %>%
  dplyr::pull(ADT)

# first filter sce 
filtered_sce <- sce[,retain_barcodes]
altExp(filtered_sce) <- altExp(filtered_sce)[retain_adts,]

baseline <- DropletUtils::ambientProfileBimodal(altExp(filtered_sce))
sf_amb <- scuttle::medianSizeFactors(altExp(filtered_sce), reference = baseline)
sizeFactors(altExp(filtered_sce)) <- sf_amb
sce <- scater::logNormCounts(altExp(filtered_sce), size.factors = sf_amb)

sjspielman commented 2 years ago

You are using the output from downstream analyses right? I'm confused why you are seeing failures?

I definitely am, so now we're all confused!! I honestly can't find the discrepancy between our code, other than my direct reference the CITEseq altexp name (but this won't matter for our data!). I've run your code through several times, mine through several times..and there's something I'm just not seeing something because sometimes I get 0 and sometimes I get 2 size factors =0! Hm!

sjspielman commented 2 years ago

@allyhawkins Ok, I think I've found and addressed the fiddly logic. Now I'm at no 0 size factors! I think we're about there..

sjspielman commented 2 years ago

One final question I have is how we feel about the file name? Should the name be less prescriptive about pipeline order like add-rms-celltypes.R is, without 01a?

sjspielman commented 2 years ago

@allyhawkins Also note the change as well in https://github.com/AlexsLemonade/sc-data-integration/pull/184/commits/667fb30805ca3287295d35a7e52f7a1512d1c9f5. I was starting to run the Gawad data through when I realized these filenames could be a bit more informative, analogous to how you named the celltype SCEs.

AlexsLemonade / sc-data-integration

Script to normalize CITE-Seq data #184