Add functions to generate QC metrics for benchmarking

allyhawkins commented 3 years ago

In working on #92 and the scpcaTools package, I also started working on ways to grab the output data from s3, import it using scpcaTools, calculate QC metrics, and create the dataframes needed for plotting the comparisons we are interested in for benchmarking. We have done this a few different ways now and previously had been using the code in 01-import-quant-data.Rmd. In that script we import data for each unique tool and unique parameter combination to create a separate list of SingleCellExperiments before combining the colData or rowData altogether into one dataframe used for generating plots of interest.

As we have started to repeat some tasks present in that code a few times now, it made sense to me to start splitting out some of the more specific benchmarking and internal data wrangling related steps into functions. So here I am adding in a group of benchmarking functions:

aws_copy_samples.R: This function takes a list of samples, a location on S3, and the tools used to grab all the relevant samples from S3 and copy over to a local directory.
quant_info_table.R: This creates a table with relevant information about each sample that was run and with what tool configuration that was used. It also creates the necessary columns needed to run scpcaTools::import_quant_data() based on what tool and parameters were used.
make_sce_list.R: This function applies scpcaTools::import_quant_data() to a set of samples and creates a list of SingleCellExperiments and then adds both colData and rowData using scater::addPerCellQC and scater::addPerFeatureQC.
coldata_to_df and rowdata_to_df both will take the row or colData of a SingleCellExperiment object and create a dataframe.

Using these 4 functions (some may or may not be useful things we could think about adding to scpcaTools), I then have a main script, benchmarking_generate_qc_df.R that takes a list of samples and tools used, creates lists of SingleCellExperiment objects and saves them as .rds files, and then writes out the quant_info, coldata dataframe and rowdata dataframe for all samples used in the benchmarking experiment. These output files are similar to the tables that were the inputs for all of the plots that were generated in 04-cell-level-benchmarking-metrics.Rmd and 05-gene-level-benchmarking-metrics.Rmd. My thought process behind making this a script is to be able to use this to create these same dataframes and .rds files for any groupings of samples and tools that we might be interested in looking at.

In addition to getting feedback on overall flow and structure of the main script interacting with the additional functions, I have a few more specific questions for reviewers:

For alevin-fry, I chose to break up the importing by configuration of alevin-fry, something you don't need for other tools - do we think this is necessary or could we treat it all as one and save one large .rds file?
Are there any places where things don't belong here but belong in scpcaTools instead?
are any functions extraneous and need to be removed entirely?

allyhawkins commented 3 years ago

Thanks for taking the time to go through this @jashapiro, I really appreciate it and apologize for making it into such a large PR with so many moving pieces. I think I was still working through how everything would fit together and was having trouble explaining one piece without the other, probably because of my lack of comments!

I'll go through and add in your comments, move some functions over to scpcaTools, add in a lot more comments, and clean up the for loops. I need to get more comfortable using purrr and all of its capabilities!

allyhawkins commented 3 years ago

I have gone through and taken out a the functions that calculate per cell qc with a mito subset and the functions that grab colData and rowData and put them into dataframes and transferred those over to scpcaTools. The remaining functions, I have simplified, added in documentation using the roxygen2 skeleton to keep documentation consistent, and added in more comments. For the main script, I also have tried to simplify it, removing all of the for loops with counters. I made saving the .rds files an optional argument using the --save flag in the main R script. For input to the script, it now takes two .tsv files, one is a list of sample IDs and one is a list of tools used. I have also tried to incorporate some more checks along the way in both the main script and the functions.

In reviewing, please let me know if there are still places that are unclear, and/or places that could still be further simplified or done in a more efficient manner.

AlexsLemonade / alsf-scpca

Add functions to generate QC metrics for benchmarking #99