Misc Run Setup: Add freecycle, adjust default errorStrategy, and fix resume/cleanup

erflynn commented 9 months ago

This PR addresses several feature requests and modifications in the run script and configuration for the single cell pipeline. Specifically:

47 Fix run_resume.sh script. This just had not been updated to match the run.sh
41 Deletes working directory if OOT error This was a slightly tricky bug. Basically, if the parent run.sh script is cancelled by the external system vs by an internal error, it has a parent error of 0 even if subprocesses do not have this. To fix this, I moved the working directory deletion from the run script to the individual steps using the workflow.onComplete directive from nextflow.
50 Get nextflow to submit to specified partition @dtm2451 solved this with unset SBATCH_PARTITION, and I implemented by adding this to the run scripts and the config. I set default to freecycle for test, and then to freecycle,krummellab,common for standard runs.
43 Check if we can set a limit to resources consumed by nextflow I set the max number of jobs to 20. I could not find an option to set the max number of cpus for SLURM (can be set for local). We may want to lower this number.
I have also changed the default errorStrategy to "finish" instead of "terminate" (so that existing steps aren't interrupted if something fails), and included additional comments about how to modify this if another strategy (e.g. ignore, retry) if desired.
EDIT 1/16/2023: this also solves the error that cellranger is very slow on /c4/scratch/ by switching to /scratch/ (now hours instead of days), but future work should implement this more cleanly

erflynn commented 9 months ago

A couple specific questions for discussion: @AlaaALatif - do you want me to update partition and working directory for the bulk pipeline as well? Happy to do so, can do this easily and push a commit.

@dtm2451 @amadeovezz @AlaaALatif - what are your thoughts on having the default partition be: freecycle,krummellab,common, default number of max jobs to be 60, and errorStrategy to 'finish'? (note: errorStrategy will be changed when we switch to dynamic retries like for bulk, this is temporary)

dtm2451 commented 9 months ago

what are your thoughts on having the default partition be: freecycle,krummellab,common, default number of max jobs to be 60, and errorStrategy to 'finish' (note: errorStrategy will be changed when we switch to dynamic retries like for bulk, this is temporary)

partition bit: sounds great to me generally, but:
- Just a note: I wonder what nextflow does if a job gets cancelled by the scheduler? We might want to build some sort of catch and retry in if possible... But since we've rarely, if ever, seen freecycle jobs actually get cancelled from primary lab prioritization, probably fine to not build this until needed
- Is there a way to have this adjust automatically when users don't have access to the krummellab partition? If not, we should document what to do if you see an sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified error.
default number of jobs as 60 seems pretty high to me? 60 cellranger jobs would take A LOT of resources.
errorStrategy bit -- :+1:, idk and happy defaulting to your suggestion there!

erflynn commented 9 months ago

@amadeovezz when you're reviewing, can you also double check the structure described in the updated example-inputs/README.md matches your updated input config structure?

amadeovezz commented 9 months ago

@amadeovezz when you're reviewing, can you also double check the structure described in the updated example-inputs/README.md matches your updated input config structure?

Yep they match!

Minor note: fmx_assign_to_gt and ref_vcf_dir aren't in there yet, but your PR is close to merging so they should be there soon

erflynn commented 9 months ago

Helpful feedback from nextflow slack on how to set max.cpus, max.memory w/ SLURM:

If you're looking to cap / threshold all requests from a pipeline, that's more difficult. You can do it in nf-core pipelines using --max_cpus and --max_memory etc (docs) but that's specific to nf-core and not a general Nextflow feature (yet, there's an issue to add it as core functionality)

https://github.com/nextflow-io/nextflow/issues/640 https://github.com/nf-core/tools/blob/99961bedab1518f592668727a4d692c4ddf3c336/nf_core/pipeline-template/nextflow.config#L206-L237

erflynn commented 9 months ago

Lingering to-dos:

[ ] add max cpus, memory based on nf-core pipelines
[ ] add fmx-assign-to-gt + ref-vcf-dir to example-inputs/README.md as well as example params
[ ] check with @AlaaALatif about which features to add to bulk_pipeline (e.g. fix for OOT error, resume script, maxes?) and add these
[ ] find out if jobs actually get cancelled on freecycle, if yes --> then update to have retries for this
[ ] look into scratch nextflow option, assess if it improves runtime

erflynn commented 7 months ago

Weird error that I have now patched -- @dtm2451 and I observed that cellranger is much slower (days instead of hours) running from the nextflow pipeline. It appears to be the difference between writing to /scratch/<user>/<cr_job_id>/ (what we typically use) and /c4/scratch/<user>/<parent_nf_job>/ (nextflow working directory for sharing across nodes), where /scratch/ is much faster.

I have now changed this, but tagging @amadeovezz @AlaaALatif -- if you have slow steps they should write to /scratch/ instead of /c4/scratch/ and then move the data to /c4/scratch/ after.

We probably want to switch to doing this for all steps that take more than a few mins? I think that this nextflow flag should do it - but I need to look into more: https://www.nextflow.io/docs/latest/process.html#scratch

UCSF-DSCOLAB / data_processing_pipelines

Misc Run Setup: Add freecycle, adjust default errorStrategy, and fix resume/cleanup #60

47 Fix run_resume.sh script. This just had not been updated to match the run.sh

50 Get nextflow to submit to specified partition @dtm2451 solved this with unset SBATCH_PARTITION, and I implemented by adding this to the run scripts and the config. I set default to freecycle for test, and then to freecycle,krummellab,common for standard runs.

43 Check if we can set a limit to resources consumed by nextflow I set the max number of jobs to 20. I could not find an option to set the max number of cpus for SLURM (can be set for local). We may want to lower this number.