determine workaround for UW HPC ckpt time limitations

shellywanamaker commented 8 hours ago

GPU options?

sr320 commented 5 hours ago

We can also split read files, then merge alignments.

shellywanamaker commented 5 hours ago

my config file that I got from Carson Miller at UWpeds has a parameter that specifies using ckpt for the first attempt at running a task. if that fails, it retries using our node.

process {
    executor = 'slurm'
    queue = { task.attempt == 1 ? 'ckpt' : 'cpu-g2-mem2x' }
    maxRetries = 1
    clusterOptions = { "-A srlab" }
    scratch = '/gscratch/scrubbed/srlab/'
}

shellywanamaker commented 5 hours ago

The methylseq pipeline I'm currently running shows for the sample EF07-EM01-Zygote_2 it first attempted bismark align on ckpt

Screenshot 2024-12-04 124433

And when this failed because of the time limit, it re-attempted on our node Screenshot 2024-12-04 124458

but this is currently stalled because the time specified conflicts with the scheduled maintenance

shellywanamaker commented 4 hours ago

I restarted the pipeline and I updated my config file to prevent individual tasks from running more than 3 days.

process {
    executor = 'slurm'
    queue = { task.attempt == 1 ? 'ckpt' : 'cpu-g2-mem2x' }
    maxRetries = 1
    clusterOptions = { "-A srlab" }
    scratch = '/gscratch/scrubbed/srlab/'
    resourceLimits = [
                cpus: 16,
                memory: '150.GB',
                time: '72.h'
        ]
}

shellywanamaker commented 3 hours ago

Adding this thread from the nf-core slack direct message I have going with Carson Miller

Hi Carson, I have a question about the config file and using the ckpt resource on hyak. I read here https://hyak.uw.edu/docs/compute/checkpoint/ that jobs are stopped and requeued every 4-5 hours and I'm wondering if I have a sample that takes longer than 5 hours to process if it will be able to requeue from where it left off in the pipeline (for instance if it was halfway through aligning reads)? Or if it restarts the alignment from the beginning, in which case would it end up in a loop and never be able to finish aligning given the 4-5 hour time constraint of ckpt?

Carson Miller Hi Shelly, unfortunately the way that Nextflow caches jobs means that the job will have to be completely restarted. The way I have handled this is using dynamic resource requests, I’m not sure if my config has this, but basically the idea is that on attempt 1 I submit to the ckpt queue, then if it fails I’ll resubmit it to another queue (ie compute) with an increased time/memory request. I can send you an example if this doesn’t make sense

Shelly Wanamaker ok I get that and i think your config file does do that. i modified it to use the resources I have access to and I think it's this part:

process {
    executor = 'slurm'
    queue = { task.attempt == 1 ? 'ckpt' : 'cpu-g2-mem2x' }
    maxRetries = 1
    clusterOptions = { "-A srlab" }
    scratch = '/gscratch/scrubbed/srlab/'
}

Carson Miller That looks great to me!

Shelly Wanamaker Looking at the .command.run file for a task that failed I can see it tried the ckpt resource and second attempt tried the cpu-g2-mem2x resource but these won't run because the time specified conflicts with the scheduled maintenance. can i modify that parameter in the config file?

Carson Miller Yes, you should be able to modify the resources requested by a specific module in the conf/modules.config file withName:

ASSEMBLYANNOTATE {
        array = 100
        cpus   = { 2                    }
        memory = { 7.GB * task.attempt  }
        time   = { 4.h * task.attempt   }
    }

And you can set a max like this in your nextflow.config or modules.config so that you can make sure the job request doesn't conflict with the scheduled maintenance

params {
    resourceLimits = [
        cpus: 16,
        memory: '200.GB',
        time: '72.h'
        ]
    }

Shelly Wanamaker i do have this in my nextflow.config file (copied from yours)

params {
    config_profile_description = 'UW Hyak Roberts labs cluster profile provided by nf-core/configs.'
    config_profile_contact = 'Shelly A. Wanamaker @shellywanamaker'
    config_profile_url = 'https://faculty.washington.edu/sr320/'
    max_memory = 742.GB
    max_cpus = 40
    max_time = 72.h
}

but it seems like i need the resourceLimits parameter

Carson Miller Yeah, there has been a recent shift away from max_memory and those other parameters in Nextflow/nf-core pipelines (edited)

Shelly Wanamaker I just added the following modification to my nextflow.config file

params {
        config_profile_description = 'UW Hyak Roberts labs cluster profile provided by nf-core/configs.'
        config_profile_contact = 'Shelly A. Wanamaker @shellywanamaker'
        config_profile_url = 'https://faculty.washington.edu/sr320/'
        resourceLimits = [
                cpus: 16,
                memory: '150.GB',
                time: '72.h'
        ]
}

and tried resuming my pipeline but got an invalid input values warning

Carson Miller Try running nextflow self-update

Shelly Wanamaker interesting, it updated and is now running nextflow 24.10.2 but still throwing the same warning

Carson Miller My mistake, this should be in the process section and not the params section. Sorry for the confusion! https://www.nextflow.io/docs/latest/reference/process.html#resourcelimits

Shelly Wanamaker oh that makes sense! thank you so much for your help with this!

Carson Miller Not a problem! Hopefully this will allow the pipeline to work correctly for you!

Shelly Wanamaker yes! no more warning

kubu4 commented 2 hours ago

Cool! Thanks for all of this!!!

Resilience-Biomarkers-for-Aquaculture / Resilience-Biomarkers-for-Aquaculture.github.io

determine workaround for UW HPC ckpt time limitations #4