CellRanger Stage Memory Issue

1onic commented 1 year ago

Hello,

I am currently trying to configure CellRanger to a large cluster (10k+ cores) and I am finding it necessary to use the overrides option flag to provide a .json with an override targetted to address the following CellRanger stage memory issue:

2022-10-06 00:25:19 [runtime] (ready)           ID.AGG1.SC_RNA_AGGREGATOR_CS.COUNT_AGGR.SC_RNA_AGGREGATOR.NORMALIZE_DEPTH
2022-10-06 00:25:19 [runtime] (run:slurm)       ID.AGG1.SC_RNA_AGGREGATOR_CS.COUNT_AGGR.SC_RNA_AGGREGATOR.NORMALIZE_DEPTH.fork0.split
2022-10-06 00:25:19 [runtime] (ready)           ID.AGG1.SC_RNA_AGGREGATOR_CS.COUNT_AGGR.SC_RNA_AGGREGATOR.CRISPR_AGGR_INPUT_PREP
2022-10-06 00:25:19 [runtime] (run:slurm)       ID.AGG1.SC_RNA_AGGREGATOR_CS.COUNT_AGGR.SC_RNA_AGGREGATOR.CRISPR_AGGR_INPUT_PREP.fork0.chnk0.main
2022-10-06 00:25:50 [runtime] (failed)          ID.AGG1.SC_RNA_AGGREGATOR_CS.COUNT_AGGR.SC_RNA_AGGREGATOR.NORMALIZE_DEPTH

Leading to:

[error] Pipestance failed. Error log at:
AGG1/SC_RNA_AGGREGATOR_CS/COUNT_AGGR/SC_RNA_AGGREGATOR/NORMALIZE_DEPTH/fork0/split-u36763e66df/_errors
Log message:
Job failed in stage code
signal: killed

Then inspecting SC_RNA_AGGREGATOR_CS/COUNT_AGGR/SC_RNA_AGGREGATOR/NORMALIZE_DEPTH and looking at "/fork0/split-u36763e66df/_errors" I see the following:

Job failed in stage code
signal: killed
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=4634534.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

"The latter is more serious, as clusters may impose strict memory limits, and kill a job if those limits are exceeded."

Based on the above, I though that increasing the memory of SC_RNA_AGGREGATOR_CS/COUNT_AGGR/SC_RNA_AGGREGATOR/NORMALIZE_DEPTH through https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/cluster-mode would mean that my overrides will look like this:

{
    "SC_RNA_AGGREGATOR_CS.COUNT_AGGR.SC_RNA_AGGREGATOR.NORMALIZE_DEPTH": {
       "chunk.mem_gb": 24,
       "chunk.threads": 2
    }
}

But this does not work and leads to the same issue show above (oom-kill at fork0/split stage). After applying the override above I also saw that the jobscript releated to fork0/split remained the same: 4G. How can I increase the memory of the fork0/split stage in this case? Do I need to specify additional information in the json?

Best and thanks,

evolvedmicrobe commented 1 year ago

Hi @1onic,

This is probably worth a ping to support@10xgenomics.com, they'll be able to work through this with you and also provide instructions on how to upload the pipestance data that will allow us to debug more easily.

But as a quick response, you hit the issue spot on, your change adjusts the memory for the chunk stage but it was the split step of the split/chunk/join process that had the out of memory error. To change the memory request for the split stage, you can convert what you wrote to split.mem_gb, see here for more details and an example. I wouldn't bother changing the default threads.

Am curious how you encountered a memory issue here though. When we developed the recently released Chromium X instrument the stages in our aggregation pipeline were pretty heavily stressed tested to ensure they could handle the incredibly large datasets that instrument enables, and this stage shouldn't be consuming more than the default 4 GB of memory. Some clusters limit Virtual Memory instead of actual in-use memory which might partially explain this, but if you upload the tarball to our support team we can likely better diagnose and help make sure no one else encounters this problem.

Warm wishes, Nigel

1onic commented 1 year ago

Hi @evolvedmicrobe,

Thanks for the overrides. This issue was encountered when aggregating the KD8 dataset from Replogle et al. 2022: https://pubmed.ncbi.nlm.nih.gov/35688146/

10XGenomics / cellranger

CellRanger Stage Memory Issue #190