Nextomics / NextPolish

Fast and accurately polish the genome generated by long reads.
GNU General Public License v3.0
200 stars 27 forks source link

NextPolish slurmstepd: error: JOB CANCELLED DUE TO TIME LIMIT #103

Closed martinmau1 closed 1 year ago

martinmau1 commented 1 year ago

Describe the bug Hello,

I ran into a timeout problem with the SLURM routine within NextPolish. How can I allow more computing time? Is there a config file for slurm within NextPolish or do I alter the paralleltask cluster.cfg for that and which command would that be? Many thanks for your help.

Best wishes Martin

Error message cd /globalhome/mam880/HPC/nextpolish_out/ES612/run1/00.lgs_polish/04.polish.ref.sh.work/polish_genome01

Operating system HPC; Gentoo Base System release 2.6

GCC gcc/11.3.0

Python python/3.8.10/

NextPolish v1.4.1

To Reproduce (Optional) the run.cfg I used:

[General] job_type = SLURM job_prefix = nextPolish task = best rewrite = yes rerun = 3 parallel_jobs = 12 multithread_jobs = 5 genome = /datastore/SEED1/Martin/Apomixis_Breeding/Analyses_output/ONT/nextdenovo_out/ES612/assembly_3/03.ctg_graph/01.ctg_graph.sh.work/nd.asm.fasta genome_size = auto workdir = ${HOME}/nextpolish_out/ES612/run1 polish_options = -p 35

[sgs_options] sgs_fofn = ./sgs1.fofn sgs_options = -max_depth 100 -bwa

[lgs_options] lgs_fofn = ./lgs1.fofn lgs_options = -min_read_len 1k -max_depth 100 lgs_minimap2_options = -x map-ont

Additional context (Optional) Add any other context about the problem here.

moold commented 1 year ago

Try to use your own alignment pipeline, and then only use NextPolish to polish the genome, see here

PS: I don't know how to allow more computing time, maybe you need to ask your system administrator for help

martinmau1 commented 1 year ago

Hello,

thank you for your advice. I could now fix at least the timeout issue in paralleltask cluster.cfg file and nextpolish with standard run.cfg for short and long reads now continues to run. But the underlying problem seems that especially in the "polish_genome" step the program seems to not run properly or at least very slow as almost no CPU is used and also the memory usage is pretty low (please see attachment). nextpolish Do you have any idea how I can fix this in nextpolish or paralleltask?

Again many thanks for your help

Best wishes Martin

moold commented 1 year ago

First, try to kill all tasks, and rerun, it will only continue to run unfinished tasks.

martinmau1 commented 1 year ago

Hello,

I have killed the jobs every time when trying a new run and upon restart I run into the same issue that there is now plenty of time but when checking on the node usage it seems basically idling during the polishing step although new files are created in each tasks (n=12). It seems that the next step is missing to go into the next round of polishing (I use 5,5,1,2,1,2). Can it be that when a program uses almost no CPU, it is because one worker in a parallel computation is waiting for another. This could be the case here: perhaps your controller program (running on the login node) is not communicating with the submitted jobs properly.

Is there a verbose/debug log that I can check and where can I find it/whats the name?

Many thanks Martin

moold commented 1 year ago

Hi, if you are sure it is not caused by the system/computer, then you have hit a bug where some sub-processes are crashed , which blocked the main process. So, can you extract the unfinished scaffold/contig sequences and bam and send it to me? I need to reproduce this bug to fix it.

martinmau1 commented 1 year ago

Hello,

its seems I have solved the issue and the program is running through producing all expected output files. For that to happen I again had to alter the cluster.cfg file from paralleltask and defined fixed values for 'memory-per-cpu' and 'cpus-per-task' overwriting the demand requested by nextdenovo subroutines when leaving this cluster.cfg file at default (e.g. memory-per-cpu = 8G instead of memory-per-cpu = {mem}). Also I've shortened the computing time request to 1hr instead of 48hrs in the paralleltask cluster.cfg as the initial time request was probably too much so some jobs were in 'PD'.

Thanks for your help anyways! Martin