Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
350 stars 52 forks source link

NextDenovo stops running but doesn't exit when a sub-job fails. #78

Closed TypicalSEE closed 3 years ago

TypicalSEE commented 3 years ago

Describe the bug Hi, I'm using NextDenovo to assembly a simple haploid plant genome. I submitted NextDenovo script to a 96-thread server(local mode in the run.cfg). After 22-hour running, NextDenovo seems to have stopped. All python processes of NextDenovo consume 0% CPU and it keeps that way for a long time. The task still exists in the SGE task list. But there aren't any error messages in the log. The last lines of the log says 8 jobs are throw in the local cycle. So I manually checked files in the working directory. On the last stage(03.ctg_graph/03.ctg_cns.sh.work/ctg_cns*), ctg_cns1 failed and I can see core.43747 and core.43762. But NextDenovo didn't re-run this sub-job and didn't exit.

Error message pid4820.log.info.txt (main log) bryophytes.sh.e.txt (log of the failed sub-job) The last a few lines of the main log: ... ... [INFO] 2020-07-21 06:42:06,183 ctg_align done [INFO] 2020-07-21 06:42:06,792 analysis tasks done [INFO] 2020-07-21 06:42:07,316 total jobs: 8 [INFO] 2020-07-21 06:42:07,318 Throw jobID:[43620] jobCmd:[/vol3/liuyang_group/yujin/projects/ont/893/01.nextdenovo.retry/workdir/03.ctg_graph/03.ctg_cns.sh.work/ctg_cns0/bryophytes.sh] in the local_cycle. [INFO] 2020-07-21 06:42:07,822 Throw jobID:[43625] jobCmd:[/vol3/liuyang_group/yujin/projects/ont/893/01.nextdenovo.retry/workdir/03.ctg_graph/03.ctg_cns.sh.work/ctg_cns1/bryophytes.sh] in the local_cycle. [INFO] 2020-07-21 06:42:08,327 Throw jobID:[43638] jobCmd:[/vol3/liuyang_group/yujin/projects/ont/893/01.nextdenovo.retry/workdir/03.ctg_graph/03.ctg_cns.sh.work/ctg_cns2/bryophytes.sh] in the local_cycle. [INFO] 2020-07-21 06:42:08,831 Throw jobID:[43643] jobCmd:[/vol3/liuyang_group/yujin/projects/ont/893/01.nextdenovo.retry/workdir/03.ctg_graph/03.ctg_cns.sh.work/ctg_cns3/bryophytes.sh] in the local_cycle. [INFO] 2020-07-21 06:42:09,335 Throw jobID:[43648] jobCmd:[/vol3/liuyang_group/yujin/projects/ont/893/01.nextdenovo.retry/workdir/03.ctg_graph/03.ctg_cns.sh.work/ctg_cns4/bryophytes.sh] in the local_cycle. [INFO] 2020-07-21 06:42:09,839 Throw jobID:[43653] jobCmd:[/vol3/liuyang_group/yujin/projects/ont/893/01.nextdenovo.retry/workdir/03.ctg_graph/03.ctg_cns.sh.work/ctg_cns5/bryophytes.sh] in the local_cycle. [INFO] 2020-07-21 06:42:10,344 Throw jobID:[43658] jobCmd:[/vol3/liuyang_group/yujin/projects/ont/893/01.nextdenovo.retry/workdir/03.ctg_graph/03.ctg_cns.sh.work/ctg_cns6/bryophytes.sh] in the local_cycle. [INFO] 2020-07-21 06:42:10,846 Throw jobID:[43663] jobCmd:[/vol3/liuyang_group/yujin/projects/ont/893/01.nextdenovo.retry/workdir/03.ctg_graph/03.ctg_cns.sh.work/ctg_cns7/bryophytes.sh] in the local_cycle.

Python processes of NextDenovo ctg_cns.py 无标题

Genome characteristics A haploid plant genome. Estimated genome size is ~800Mb.

Input data Total ONT reads is 82Gb. Sequencing depth is ~100X. Reads N50 is 22715.

Operating system CentOS Linux release 7.5.1804 (Core)

GCC 7.4.0(Mannually installed in user directory. Main executables in PATH and set LD_LIBRARY_PATH). 4.8.5(System default. Not using it.)

Python 2.7.15 and 3.6.5

NextDenovo 2.3.0

Additional context (Optional) I also tried running NextDenovo on SGE mode. But every time I ran NextDenovo, it throws an error saying a sub-job failed and exited. But there wasn't any error message in the corresponding sub-job log. And the .done file of that sub-job exists. So I had to run again and let NextDenovo start running from where it had exited. Usually it will take me 3~5 times of running to finally finish the whole task.

moold commented 3 years ago

Did you rerun the failed jobs? It seems that some child processes failed, which blocked the parent process. I am not sure if this is caused by insufficient RAM or bugs? so if you rerun a failed job and get blocked again, you can extract the unfinished contigs and the corresponding bam file, and send to me, I am happy to debug it.

moold commented 3 years ago

Hi, if you still have this problem, you can try to download this file and replace lib/ctg_cns.so with it.

TypicalSEE commented 3 years ago

Thanks. I re-ran the main script and NextDenovo finished successfully. But another problem is that running in SGE mode fails every time. Log says 01.raw_align/03.sort_align.sh.work/sort_align1 failed, while the output files seem OK and the .done file exists. I feel it might be a bug.

moold commented 3 years ago

Yes, this is a known bug, I will fix it in the next release. Thank you for your feedback.