Closed allyhawkins closed 3 years ago
Thanks for taking the time to go through this @jashapiro, I really appreciate it and apologize for making it into such a large PR with so many moving pieces. I think I was still working through how everything would fit together and was having trouble explaining one piece without the other, probably because of my lack of comments!
I'll go through and add in your comments, move some functions over to scpcaTools
, add in a lot more comments, and clean up the for
loops. I need to get more comfortable using purrr
and all of its capabilities!
I have gone through and taken out a the functions that calculate per cell qc with a mito subset and the functions that grab colData
and rowData
and put them into dataframes and transferred those over to scpcaTools. The remaining functions, I have simplified, added in documentation using the roxygen2
skeleton to keep documentation consistent, and added in more comments. For the main script, I also have tried to simplify it, removing all of the for
loops with counters. I made saving the .rds
files an optional argument using the --save
flag in the main R script. For input to the script, it now takes two .tsv
files, one is a list of sample IDs and one is a list of tools used. I have also tried to incorporate some more checks along the way in both the main script and the functions.
In reviewing, please let me know if there are still places that are unclear, and/or places that could still be further simplified or done in a more efficient manner.
In working on #92 and the
scpcaTools
package, I also started working on ways to grab the output data from s3, import it usingscpcaTools
, calculate QC metrics, and create the dataframes needed for plotting the comparisons we are interested in for benchmarking. We have done this a few different ways now and previously had been using the code in01-import-quant-data.Rmd
. In that script we import data for each unique tool and unique parameter combination to create a separate list ofSingleCellExperiments
before combining thecolData
orrowData
altogether into one dataframe used for generating plots of interest.As we have started to repeat some tasks present in that code a few times now, it made sense to me to start splitting out some of the more specific benchmarking and internal data wrangling related steps into functions. So here I am adding in a group of benchmarking functions:
aws_copy_samples.R
: This function takes a list of samples, a location on S3, and the tools used to grab all the relevant samples from S3 and copy over to a local directory.quant_info_table.R
: This creates a table with relevant information about each sample that was run and with what tool configuration that was used. It also creates the necessary columns needed to runscpcaTools::import_quant_data()
based on what tool and parameters were used.make_sce_list.R
: This function appliesscpcaTools::import_quant_data()
to a set of samples and creates a list ofSingleCellExperiments
and then adds bothcolData
androwData
usingscater::addPerCellQC
andscater::addPerFeatureQC
.coldata_to_df
androwdata_to_df
both will take the row or colData of aSingleCellExperiment
object and create a dataframe.Using these 4 functions (some may or may not be useful things we could think about adding to scpcaTools), I then have a main script,
benchmarking_generate_qc_df.R
that takes a list of samples and tools used, creates lists ofSingleCellExperiment
objects and saves them as.rds
files, and then writes out thequant_info
, coldata dataframe and rowdata dataframe for all samples used in the benchmarking experiment. These output files are similar to the tables that were the inputs for all of the plots that were generated in04-cell-level-benchmarking-metrics.Rmd
and05-gene-level-benchmarking-metrics.Rmd
. My thought process behind making this a script is to be able to use this to create these same dataframes and.rds
files for any groupings of samples and tools that we might be interested in looking at.In addition to getting feedback on overall flow and structure of the main script interacting with the additional functions, I have a few more specific questions for reviewers:
.rds
file?scpcaTools
instead?