Question regarding parallelization and sampling

ccruizm commented 3 months ago

Good day!

I am excited to test your new version of CYTOTRACE! I am running it in my data (it has not been finished yet) and have a couple of questions already:

I set ncores=16 but still see that some sections only use one core. Is that because those sections can only run using one core, or is there an issue with parallelizing?
I have ~250K cancer cells that I want to get the stemness/differentiated scores. I have subsampled it to ~30% of the cells and set batch_size = NULL and smooth_batch_size = NULL. Compared with subsampling, are there significant differences between using all cells for the predictions? Is there a proportion you recommend for the subsampling? The default values are set to 10K and 1K, respectively, but in vignette one, you mentioned the values to reproduce the results were 10x higher than the defaults (or maybe it is a typo 😅).

Bellow the output I have so far.

cytotrace2_result <- cytotrace2(malignant_subset,
                                species = "human",
                                is_seurat = TRUE,
                                slot_type = "counts",
                                full_model = TRUE,
                                batch_size = NULL,
                                smooth_batch_size = NULL,
                                parallelize_models = TRUE,
                                parallelize_smoothing = TRUE,
                                ncores = 16,
                                max_pcs = 200,
                                seed = 14)

cytotrace2: Started loading data

Warning message in asMethod(object):
“sparse->dense coercion: allocating vector of size 11.0 GiB”
Dataset contains 19248 genes and 76456 cells.

Please consider reducing the batch_size to 10000 for runtime and memory efficiency.

cytotrace2: Running on 1 subsample(s) approximately of length 76456

cytotrace2: Started running on subsample(s). This will take a few minutes.

cytotrace2: Started preprocessing.

14091 input genes mapped to model genes.

cytotrace2: Started prediction.

This section will run using  1/128 core(s).

cytotrace2: Started postprocessing.

Please consider reducing the smooth_batch_size to a number in range 1000 - 3000 for runtime and memory efficiency.

cytotrace2: Running with fast mode (subsamples are processed in parallel)

This section will run on 1 sub-sample(s) of approximately 76456 cells each using 1/128 core(s).

Thanks in advance!

ccruizm commented 3 months ago

Update: I just ran it using the default downsampling and full_model=FALSE, and that one did parallelize the jobs. What could be the reason for that behavior? Thanks again!

cytotrace2_result <- cytotrace2(malignant_subset,
                                species = "human",
                                is_seurat = TRUE,
                                slot_type = "counts",
                                full_model = FALSE,
                                batch_size = 10000,
                                smooth_batch_size = 1000,
                                parallelize_models = TRUE,
                                parallelize_smoothing = TRUE,
                                ncores = 40,
                                max_pcs = 200,
                                seed = 14)

cytotrace2: Started loading data

Warning message in asMethod(object):
“sparse->dense coercion: allocating vector of size 11.0 GiB”
Dataset contains 19248 genes and 76456 cells.

cytotrace2: Running on 8 subsample(s) approximately of length 10000

cytotrace2: Started running on subsample(s). This will take a few minutes.

cytotrace2: Started preprocessing.

14091 input genes mapped to model genes.

cytotrace2: Started prediction.

This section will run using  8 [/](http://localhost:8885/) 128 core(s).

cytotrace2: Started postprocessing.

cytotrace2: Running with fast mode (subsamples are processed in parallel)

This section will run on 10 sub-sample(s) of approximately 956 cells each using 10 [/](http://localhost:8885/) 128 core(s).

cytotrace2: Started preprocessing.

14091 input genes mapped to model genes.

cytotrace2: Started prediction.

This section will run using  8 [/](http://localhost:8885/) 128 core(s).

cytotrace2: Started postprocessing.

cytotrace2: Running with fast mode (subsamples are processed in parallel)

This section will run on 10 sub-sample(s) of approximately 956 cells each using 10 [/](http://localhost:8885/) 128 core(s).

cytotrace2: Started preprocessing.

14091 input genes mapped to model genes.

cytotrace2: Started prediction.

This section will run using  8 [/](http://localhost:8885/) 128 core(s).

cytotrace2: Started postprocessing.

cytotrace2: Running with fast mode (subsamples are processed in parallel)

This section will run on 10 sub-sample(s) of approximately 956 cells each using 10 [/](http://localhost:8885/) 128 core(s).

cytotrace2: Started preprocessing.

14091 input genes mapped to model genes.

cytotrace2: Started prediction.

This section will run using  8 [/](http://localhost:8885/) 128 core(s).

cytotrace2: Started postprocessing.

cytotrace2: Running with fast mode (subsamples are processed in parallel)

This section will run on 10 sub-sample(s) of approximately 956 cells each using 10 [/](http://localhost:8885/) 128 core(s).

cytotrace2: Started preprocessing.

14091 input genes mapped to model genes.

cytotrace2: Started prediction.

This section will run using  8 [/](http://localhost:8885/) 128 core(s).

cytotrace2: Started postprocessing.

cytotrace2: Running with fast mode (subsamples are processed in parallel)

This section will run on 10 sub-sample(s) of approximately 956 cells each using 10 [/](http://localhost:8885/) 128 core(s).

cytotrace2: Started preprocessing.

14091 input genes mapped to model genes.

cytotrace2: Started prediction.

This section will run using  8 [/](http://localhost:8885/) 128 core(s).

cytotrace2: Started postprocessing.

cytotrace2: Running with fast mode (subsamples are processed in parallel)

This section will run on 10 sub-sample(s) of approximately 956 cells each using 10 [/](http://localhost:8885/) 128 core(s).

cytotrace2: Started preprocessing.

14091 input genes mapped to model genes.

cytotrace2: Started prediction.

This section will run using  8 [/](http://localhost:8885/) 128 core(s).

cytotrace2: Started postprocessing.

cytotrace2: Running with fast mode (subsamples are processed in parallel)

This section will run on 10 sub-sample(s) of approximately 956 cells each using 10 [/](http://localhost:8885/) 128 core(s).

cytotrace2: Started preprocessing.

14091 input genes mapped to model genes.

cytotrace2: Started prediction.

This section will run using  8 [/](http://localhost:8885/) 128 core(s).

cytotrace2: Started postprocessing.

cytotrace2: Running with fast mode (subsamples are processed in parallel)

This section will run on 10 sub-sample(s) of approximately 956 cells each using 10 [/](http://localhost:8885/) 128 core(s).

cytotrace2: Finished

savagyan00 commented 3 months ago

Thank you for your interest in the tool and for your feedback!

Regarding the parallelization behavior you observed with parallelize_smoothing = TRUE, smoothing parallelization is done across batches of cells depending on smooth_batch_size. Passing NULL for this as you did for your first run will disable batching (and the associated parallelization as a result), in contrast to passing a value as you did for your second run. We will update the documentation to clarify this!

Following your comment, we have updated the internal parallelization logic so that the other type of parallelization within the code (parallelize_models) will be fully independent from the smoothing parallelization described above. Now ensemble model predictions will still be parallelized even when smoothing parallelization is disabled, as long as parallelize_models = TRUE.

Please note that the full_model argument determines the ensemble size for the core model prediction step of CytoTRACE 2 and is not related to the parallelization logic.

Finally, regarding batch size arguments, the defaults we provide and use for the tutorial vignettes are indeed smaller than the values to reproduce manuscript results. Setting these values smaller improves runtime efficiency and memory usage, however we find that results across batch sizes are highly correlated in practice. For final results, we recommend larger batch sizes (e.g., batch_size = 100000 and smooth_batch_size = 10000) when system resources permit.

Let us know if there are any further questions!

digitalcytometry / cytotrace2

Question regarding parallelization and sampling #1