Closed sjspielman closed 2 years ago
So I think we actually want to be working with objects that go through downstream analyses first (meaning bad cells are removed and RNA-seq assay is normalized), then those libraries that have corresponding CITE-seq go through further processing.
This makes sense, and bonus is also not annoying! ;) I'll see how the ADT count situation changes with the appropriately processed input files, stay tuned.
@allyhawkins As I revisit this, I'm curious to hear your thoughts about structuring directories. Part of me desperately wants to overwrite files saved in scpca-downstream-analyses/
, while part of me sees this as a horrible plan 😂
Currently I'm sketching this out to write new files to a results directory scpca-downstream-analyses-citeseq/
, where files are either simply copied (if no citeseq) or updated and written out (with citeseq).
But! If, as we've lightly discussed over in #183, the sce_dir
argument may be going away, then maybe it's very fair to not copy files that don't need further processing, since directories for where filtered libraries are stored will be going into metadata?
Please also note there are a lot of things called "filtered" at this point and my brain is starting to crack, so maybe I'm just confusing myself with paths. 🥴
Currently I'm sketching this out to write new files to a results directory scpca-downstream-analyses-citeseq/, where files are either simply copied (if no citeseq) or updated and written out (with citeseq). But! If, as we've lightly discussed over in https://github.com/AlexsLemonade/sc-data-integration/pull/183, the sce_dir argument may be going away, then maybe it's very fair to not copy files that don't need further processing, since directories for where filtered libraries are stored will be going into metadata?
I definitely don't think we need to be copying any files that don't have CITE-seq, that seems like a waste of space that we should be able to avoid. Given that we are going to be adding the input directory to be required in the metadata file, I think we can specifically store the SCEs with CITE-seq that have been further processed in a new directory and specify using that for input to integration.
@allyhawkins I'm coming back to revise this shortly, and want to note that in https://github.com/AlexsLemonade/sc-data-integration/pull/184/commits/f449ad39c90c8841232c00749062b530485ad7e2 I ended up refactoring code a bit when merging in main
in order to address conflicts. Specifically, I updated the new function I added find_downstream_sce_files()
can handle multiple SCE directories, and now incorporate that into 02-prepare-merged-sce.R
. I also moved your small helper function perform_file_search()
into integration-helpers.R
. Currently it is only used when called from find_downstream_sce_files()
, but I imagine it could be helpful to have in case of other use cases that come up! Please let me know if you have feedback on these choices when I re-request review (not ready yet I think).
Please let me know if you have feedback on these choices when I re-request review (not ready yet I think).
Okay sounds good! I'll wait for the review re-request to take a look at everything then.
@allyhawkins ready for a look now!
The metadata slot now includes a column citeseq_percent_cells_removed
, and I now print out a warning if the while
loop (which is still present, for now) is entered. Regarding your comment:
So from this maybe we should be removing low abundant ADTs not cells with size factor of 0 to combat the problem if we move forward with the while loop. It looks like we could use
librarySizeFactors
, but there is an additional note about working with ADT data that recommends against using that.
I feel like they make a good argument about not using librarySizeFactors
due to biases, so I'm hesitant to go that direction. Something we might consider is a threshold for throwing an error if "too many" size factors are 0 after the filtering step, but I don't have a clear sense of how to choose this in a principled manner besides just saying "eh how about 1% of cells"...
@allyhawkins I've implemented this ADT check, and sadly it doesn't help. I included a command line argument for adt_threshold
and played around with different values 1, 2, 10, 20, 30, 40, 50. <=40 still have 2 cells in SCPCL000296
with 0 size factors, and at 50 there are no more 0 size factors but we lose a lot of ADTs!
Currently the script will entirely error out if 0's are encountered. Now I'm thinking the way to go here could be, rather than iteratively calculated size factors with while
until all are >0, maybe we just do the calculation once and remove cells with 0.
@allyhawkins I've implemented this ADT check, and sadly it doesn't help. I included a command line argument for adt_threshold and played around with different values 1, 2, 10, 20, 30, 40, 50. <=40 still have 2 cells in SCPCL000296 with 0 size factors, and at 50 there are no more 0 size factors but we lose a lot of ADTs!
You are using the output from downstream analyses right? I'm confused why you are seeing failures? Running the below code on my end works for me without errors... and only removes 1 ADT that has very low expression (mean 0.005).
sce_dir <- "results/scpca/scpca-downstream-analyses"
sce <- readr::read_rds(file.path(sce_dir, "SCPCS000222", "SCPCL000296_miQC_processed_sce.rds"))
retain_barcodes <- DropletUtils::cleanTagCounts(altExp(sce)) %>%
tibble::as_tibble(rownames = "cell_barcode") %>%
dplyr::filter(discard == FALSE) %>%
dplyr::pull(cell_barcode)
retain_adts <- as.data.frame(rowData(altExp(sce))) %>%
tibble::rownames_to_column("ADT") %>%
dplyr::filter(detected > 1 & mean > 1) %>%
dplyr::pull(ADT)
# first filter sce
filtered_sce <- sce[,retain_barcodes]
altExp(filtered_sce) <- altExp(filtered_sce)[retain_adts,]
baseline <- DropletUtils::ambientProfileBimodal(altExp(filtered_sce))
sf_amb <- scuttle::medianSizeFactors(altExp(filtered_sce), reference = baseline)
sizeFactors(altExp(filtered_sce)) <- sf_amb
sce <- scater::logNormCounts(altExp(filtered_sce), size.factors = sf_amb)
You are using the output from downstream analyses right? I'm confused why you are seeing failures?
I definitely am, so now we're all confused!! I honestly can't find the discrepancy between our code, other than my direct reference the CITEseq altexp name (but this won't matter for our data!). I've run your code through several times, mine through several times..and there's something I'm just not seeing something because sometimes I get 0 and sometimes I get 2 size factors =0! Hm!
@allyhawkins Ok, I think I've found and addressed the fiddly logic. Now I'm at no 0 size factors! I think we're about there..
One final question I have is how we feel about the file name? Should the name be less prescriptive about pipeline order like add-rms-celltypes.R
is, without 01a
?
@allyhawkins Also note the change as well in https://github.com/AlexsLemonade/sc-data-integration/pull/184/commits/667fb30805ca3287295d35a7e52f7a1512d1c9f5. I was starting to run the Gawad data through when I realized these filenames could be a bit more informative, analogous to how you named the celltype SCEs.
Related to #181 Closes #185
This PR adds a script
00e-process-scpca-citeseq.R
to filter and normalize ADT counts before proceeding to the downstream analyses pipeline. As part of this review, we should discuss both logistical approaches (namely how I handled directories) and the actual steps taken to process the ADT counts.I updated
scripts/README.md
to explain that this script should be run after00d-obtain-scpca-sce.R
but before the downstream analyses pipeline. This script will read in all files from thedata/scpca/filtered_sce/
directory and check if they have CITE-Seq. If they do, then the data is filtered and normalized, and ultimately saved inresults/scpca/filtered_sce/
, which should now be used as the input directory for downstream analyses. If they do not have CITE-seq, we still export the unprocessed SCE toresults/scpca/filtered_sce/
for consistency. A different approach here could be to integrate the cite-seq processing intopreprocess-sce.R
, which would remove the need for this stand-alone script wholesale and the need for extra results directories. However, this step may be specific to ScPCA so it may not make sense to include there. Also, we could have a column in the library filehas_CITEseq
and just process those, which would remove the need to read in and check for CITE-seq.For the actual ADT processing, I followed OSCA when possible and went rogue otherwise!
DropletUtils::cleanTagCounts()
to identify cells to remove.while
loop for the first time in forever to keep purging cells with 0 size factors until we aren't 0 anymore. This is my rogue coding! Extremely open for discussion!results/scpca/filtered_sce/
which will serve as input for downstreamNote that my repeat logic at the bottom maybe could use some help...
Finally, one additional incidental change in
00d-obtain-scpca-sce.R
. I think there's a bug there so I swapped out a variable to fix it. Confirm?