digitalcytometry / cytotrace2

CytoTRACE 2 is an interpretable AI method for predicting cellular potency and absolute developmental potential from scRNA-seq data.
Other
85 stars 6 forks source link

CytoTRACE 2 on large dataset (> 1 million cells) #13

Closed ttszen closed 4 months ago

ttszen commented 5 months ago

Hello! I am trying to run CytoTRACE 2 on a large dataset of over a million cells. I am running this on a HPC cluster but it times out after more than 12 hours after using the maximum amount of memory (3 TB).

The samples is the main driver of variation in my dataset. Cells from the same sample are clustering close together on a UMAP. Would you recommend to split the dataset to run on each sample?

Alternatively, there are technical batches as well, given that the samples were sequenced on a couple of sequencing runs. Would you recommend splitting up by that sequencing run instead?

It'd be great to get your thoughts and I'd appreciate any recommendations. Thanks again for a wonderful tool!

savagyan00 commented 5 months ago

Hi and thank you for using CytoTRACE 2!

We generally recommend running CytoTRACE 2 separately for each batch. Although raw predictions are made per cell independently, the postprocessing step refines predictions using information from other cells in the dataset, which could be influenced by batch effects. For more details, please see item 4 in our FAQ.

Given your situation, you can subdivide your data further, either arbitrarily or by sample, into smaller, more manageable subsets. Typically, how big are your samples within each technical batch? We suggest limiting these subsets to under 100,000 cells each for computational efficiency. Doing so should not significantly impact the results compared to running the entire dataset simultaneously.

Please let us know if this approach works for you or if you need further assistance!

ttszen commented 4 months ago

Hi! Thank you so much for your help and clarification. Apologies for the delayed response; I was trying a few different things. To provide some context, there were 2 sequencing batches (~600k cells per sequencing batch) and within each sequencing batch, there are organoid samples treated with different conditions over a course of 3 time-points. The experiments were done with 2 biological replicates.

To run CytoTRACE 2, I have designated the batch as each organoid for each time point for each replicate. This split gives me between 20,000-80,000 cells. This seems to provide some reasonable results. Do you have any recommendations if this split is appropriate?

As the samples are cancer organoids, I was also wondering if CytoTRACE 2 would be an appropriate method to run on them given the context the model was trained on?

Thanks again for the help and looking forward to discussing more.

savagyan00 commented 4 months ago

Hi, It’s great to hear back about your progress! Based on your description, the strategy you've adopted for splitting the dataset seems appropriate to help manage the computational load effectively while retaining sufficient data granularity for robust analysis.

Regarding the use of CytoTRACE 2 on cancer organoids, given the versatility of the underlying algorithm for CytoTRACE 2, you can expect reasonable performance but we would advise doing independent confirmation and functional validation to make accurate inferences. Organoids can exhibit unique cellular behaviors and microenvironments that might influence the prediction dynamics differently compared to primary tumors, on which our tool has been demonstrated to work effectively.

Please do not hesitate to reach out if you have more questions!

ttszen commented 4 months ago

Hi! Thanks so much for getting back to me. Glad that there is agreement on the strategy for splitting the dataset.

Thanks as well for your comments on the application of CytoTRACE 2 to cancer organoids too. I will keep this in mind as I go through the analysis. I will use orthogonal methods to validate findings as suggested. I'll close this issue for now but will reach out if I have further questions.

Thank you so much again for the useful tool!