Currently it takes around 110 minutes to aggregate all 9 SQLite single-cell plate data (130.07 GB) into aggregated . However, we can improve the computational time if we split the job to multiple cores
This PR demonstrate multi-processing support in the aggregate_cells.py in order to improve computational performance.
Approach
Changes were applied to the aggregate_cells.py and preprocess.smk under the aggregate rule. Since this step is an iterative process, we can assign each input to an individual core to conduct the aggregation process.
However, this requires some tweaks on how the parameters should be passed in order to conduct multi-processing
# transforming snakemake objects into python standard datatypes
sqlfiles = [str(sqlfile) for sqlfile in snakemake.input["sql_files"]]
cell_count_out = [str(out_name) for out_name in snakemake.output["cell_counts"]]
aggregate_profile_out = [
str(out_name) for out_name in snakemake.output["aggregate_profile"]
]
meta_data_dir = itertools.repeat(str(snakemake.input["metadata"]))
barcode_map = itertools.repeat(str(snakemake.input["barcodes"]))
config_path = itertools.repeat(str(snakemake.params["aggregate_config"]))
inputs = list(
zip(sqlfiles, meta_data_dir, barcode_map, cell_count_out, aggregate_profile_out, config_path)
)
# init multi process
n_cores = int(snakemake.threads)
if n_cores > len(inputs):
print(
f"WARNING: number of specify cores exceeds number of inputs, defaulting to {len(inputs)}"
)
n_cores = len(inputs)
with mp.Pool(processes=n_cores) as pool:
pool.starmap(aggregate, inputs)
pool.close()
pool.join()
sqlfilescell_count_outaggregate_profile_outmeta_data_dirbarcode_map are inputs obtained from snakemake
the itertools.repeat is essential for the zip function in order to generate static inputs
without it, then single characters will be produced inside the inputs list
pool is instantiating the number of cores based on snakemake.threads rule parameter
The pool.starmap takes in the function name and the inputs iterator and spreads it to the assign cores.
pool.close() and pool.join() are required to indicate all jobs are complete and close the multi-process scheduler.
The aggregate rule will use the threads rule attribute as a parameter for specifying the number of cores to use in the aggregate process
allowing for user to easily access and change the number of threads used under the preprocessing step. (NOTE: multi processing is only implemented in the aggregate rule and not the following subsequent rules.
Implementation Diagram
Below is a simple diagram on how the multi-processing was implemented.
Simple demonstration of the multi processing diagram. The array above is the set of input parameters required for a function. Then the multi processing module will reserve memory for the parameters and execute it through a CPU. Each CPU will run the aggregate step independently until the parameters list is finished
Performance Results
To find the optimal number of cores for aggregation, multiple tests were conducted, where each test has an increase of 2 cores for computation. The time was calculated by using the time command in the terminal:
time snakemake -c 10 --use-conda -r aggregate
The multi-processing implementation does increase aggregation performance, however there is still a major caveat. The largest file will be bottleneck the performance.
In order for the aggregate function to be considered “complete” all cores needs to be finished. This raises and issues with larger files as it will take the most amount of time to finish the processes; therefore, it must wait for that specific processes to finish. really good explanation here
Unfortunately, this is an I/O bound related issue, which means that your computer spends a lot of time reading from the hard disk before conducting any processes. In addition, I/O bound processes is a bottleneck feature as transferring data from the HDD to the RAM is very slow.
Motivation
Currently it takes around 110 minutes to aggregate all 9 SQLite single-cell plate data (130.07 GB) into aggregated . However, we can improve the computational time if we split the job to multiple cores
This PR demonstrate multi-processing support in the
aggregate_cells.py
in order to improve computational performance.Approach
Changes were applied to the
aggregate_cells.py
andpreprocess.smk
under theaggregate
rule. Since this step is an iterative process, we can assign each input to an individual core to conduct the aggregation process.However, this requires some tweaks on how the parameters should be passed in order to conduct multi-processing
sqlfiles
cell_count_out
aggregate_profile_out
meta_data_dir
barcode_map
are inputs obtained fromsnakemake
itertools.repeat
is essential for thezip
function in order to generate static inputsinputs
listpool
is instantiating the number of cores based onsnakemake.threads
rule parameterpool.starmap
takes in the function name and the inputs iterator and spreads it to the assign cores.pool.close()
andpool.join()
are required to indicate all jobs are complete and close the multi-process scheduler.The
aggregate
rule will use thethreads
rule attribute as a parameter for specifying the number of cores to use in the aggregate processLooking at the
threads
rule attribute, the number of threads specified is in theCytoPipe's
general config file:allowing for user to easily access and change the number of threads used under the preprocessing step. (NOTE: multi processing is only implemented in the
aggregate
rule and not the following subsequent rules.Implementation Diagram
Below is a simple diagram on how the multi-processing was implemented.
Simple demonstration of the multi processing diagram. The array above is the set of input parameters required for a function. Then the multi processing module will reserve memory for the parameters and execute it through a CPU. Each CPU will run the aggregate step independently until the parameters list is finished
Performance Results
To find the optimal number of cores for aggregation, multiple tests were conducted, where each test has an increase of 2 cores for computation. The time was calculated by using the
time
command in the terminal:time snakemake -c 10 --use-conda -r aggregate
The multi-processing implementation does increase aggregation performance, however there is still a major caveat. The largest file will be bottleneck the performance.
In order for the aggregate function to be considered “complete” all cores needs to be finished. This raises and issues with larger files as it will take the most amount of time to finish the processes; therefore, it must wait for that specific processes to finish. really good explanation here
Unfortunately, this is an I/O bound related issue, which means that your computer spends a lot of time reading from the hard disk before conducting any processes. In addition, I/O bound processes is a bottleneck feature as transferring data from the HDD to the RAM is very slow.