Nextomics / NextPolish

Fast and accurately polish the genome generated by long reads.
GNU General Public License v3.0
213 stars 28 forks source link

NextPolish stuck on/after step 6 merging bams when polishing with long and short reads. #58

Closed Rob-murphys closed 3 years ago

Rob-murphys commented 4 years ago

Describe the bug NextPolish stuck on/after step 6 merging bams when polishing with long and short reads.

Error message

No specific log message just a failure to move past this stage and complete the job, having previously done so with this exact same command and genome. Excerpt from log:


[INFO] 2020-11-06 08:31:14,578 index_genome done
[INFO] 2020-11-06 08:31:14,580 analysis tasks done
[INFO] 2020-11-06 08:31:14,600 total jobs: 6
[INFO] 2020-11-06 08:31:14,600 Throw jobID:[27884] jobCmd:[/home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/03.map.ref.sh.work/map_genome0/RB110-1.nextPolish.sh] in the local_cycle.
[INFO] 2020-11-06 08:31:15,102 Throw jobID:[27918] jobCmd:[/home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/03.map.ref.sh.work/map_genome1/RB110-1.nextPolish.sh] in the local_cycle.
[INFO] 2020-11-06 08:31:15,604 Throw jobID:[27952] jobCmd:[/home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/03.map.ref.sh.work/map_genome2/RB110-1.nextPolish.sh] in the local_cycle.
[INFO] 2020-11-06 08:31:16,106 Throw jobID:[27986] jobCmd:[/home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/03.map.ref.sh.work/map_genome3/RB110-1.nextPolish.sh] in the local_cycle.
[INFO] 2020-11-06 08:31:16,609 Throw jobID:[28021] jobCmd:[/home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/03.map.ref.sh.work/map_genome4/RB110-1.nextPolish.sh] in the local_cycle.
[INFO] 2020-11-06 08:31:17,111 Throw jobID:[28055] jobCmd:[/home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/03.map.ref.sh.work/map_genome5/RB110-1.nextPolish.sh] in the local_cycle.
[INFO] 2020-11-06 08:31:20,123 align_genome done
[INFO] 2020-11-06 08:31:20,128 analysis tasks done
[INFO] 2020-11-06 08:31:25,140 total jobs: 1
[INFO] 2020-11-06 08:31:25,141 Throw jobID:[28091] jobCmd:[/home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/04.merge.bam.sh.work/merge_bam0/RB110-1.nextPolish.sh] in the local_cycle.
[INFO] 2020-11-06 08:31:26,147 merge_bam done
[INFO] 2020-11-06 08:31:26,150 analysis tasks done
[INFO] 2020-11-06 08:31:26,169 total jobs: 6
[INFO] 2020-11-06 08:31:26,170 Throw jobID:[28126] jobCmd:[/home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/05.polish.ref.sh.work/polish_genome0/RB110-1.nextPolish.sh] in the local_cycle.
[INFO] 2020-11-06 08:31:26,672 Throw jobID:[28150] jobCmd:[/home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/05.polish.ref.sh.work/polish_genome1/RB110-1.nextPolish.sh] in the local_cycle.
[INFO] 2020-11-06 08:31:27,175 Throw jobID:[28191] jobCmd:[/home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/05.polish.ref.sh.work/polish_genome2/RB110-1.nextPolish.sh] in the local_cycle.
[INFO] 2020-11-06 08:31:27,676 Throw jobID:[28212] jobCmd:[/home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/05.polish.ref.sh.work/polish_genome3/RB110-1.nextPolish.sh] in the local_cycle.
[INFO] 2020-11-06 08:31:28,178 Throw jobID:[28231] jobCmd:[/home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/05.polish.ref.sh.work/polish_genome4/RB110-1.nextPolish.sh] in the local_cycle.
[INFO] 2020-11-06 08:31:28,681 Throw jobID:[28250] jobCmd:[/home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/05.polish.ref.sh.work/polish_genome5/RB110-1.nextPolish.sh] in the local_cycle.

Operating system

LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.7.1908 (Core)
Release:        7.7.1908
Codename:       Core

GCC

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)

Python Python 3.7.3

NextPolish nextPolish v1.3.0

To Reproduce (Optional) python ~/path/to/NextPolish/nextPolish run_RB110-1.cfg

.cfg file:

[General]
  job_type = local  #job type, local means that a single node is running. Since NextPolish uses DRMAA for task delivery, it also supports SGE, PBS and SLURM
  job_prefix = RB110-1.nextPolish
  task = best
  rewrite = yes
  rerun = 3
  parallel_jobs = 6
  multithread_jobs = 10
  genome = /home/lamma/faststorage/RB110/canu-outputs/RB110-1_pacbio/RB110-1_pacbio.contigs.fasta
  genome_size = auto #indicates the location of the assembled genome#
  workdir = /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1 #The directory where the output file is located#
  polish_options = -p {multithread_jobs}

  [sgs_option]
  sgs_fofn = ./nextPolish_sgs/sgs-RB110-1.txt
  sgs_options = -max_depth 100 -bwa

  lgs_fofn = ./nextPolish_lgs/lgs-RB110-1.txt
  lgs_options = -min_read_len 1k -max_depth 100
  lgs_minimap2_options = -x map-ont

Additional context This job has previously worked on this genome with no issues, and as far as I am aware nothing has changed since trying to re run.

moold commented 4 years ago

Your RAM is not enough, try parallel_jobs = 1.

Rob-murphys commented 4 years ago

I am using 80G of RAM here, that should defiantly be sufficient no? Plus this polishing has previously worked with the exact same amount of RAM on the exact same genome in the past. I can try reduce parallel_jobs though.

moold commented 4 years ago

As I said in #55, the required RAM depends on each contig length and mapping depth, and parallel_jobs, multithread_jobs and reads length.

Rob-murphys commented 4 years ago

Okay, but why would the polishing have worked previously with no changes from the current?

And the issue seems to be that NextPolish is just not doing anything more that stalling from memory, it get to this stage 6 almost instantly but then just stays there for however long I set the job to run for. of are the Throw jobID lines current jobs and not previous ones?

moold commented 4 years ago
  1. Not all processes always require maximum memory at the same time.
  2. Some of sub-processes are crashed (usually caused by insufficient memory), which blocked the main process. This is a bug of multiprocess module in python. so you should kill the main task, and just rerun the main task.
Rob-murphys commented 4 years ago

I have tried running with parrallel_jobs = 1 and a low multithread_jobs, both variations of the NextPolish run are experiencing the same issue unfortunately. Again both jobs got stuch at the exact same point, just after step 6.

moold commented 4 years ago

First, check your input data and each subtask log. If there is no problem, just delete the working directory and restart from the beginning. If there are still problems, maybe I need you to share the input data with me. so I can debug it. BTW, the probability of NextPolish code with errors is very low, most of the errors are caused by incorrect input or configuration.

Rob-murphys commented 4 years ago

I understand and have tried to delete the working directory to restart. I am quite confident there is no issues with the input because as i previously said, this exact same input has worked with these exact same commands before. I will however take a look :) Is there anything in particular I should be looking out for?

Rob-murphys commented 4 years ago

the working directory always consists of these same exact files and subdirectories if that helps debug:

00.lgs_polish  02.score_chain  input.lgspart.000.fasta.gz  input.lgspart.002.fasta.gz  input.lgspart.004.fasta.gz  input.sgspart.000.fasta.gz  input.sgspart.002.fasta.gz  input.sgspart.004.fasta.gz
01.lgs_polish  03.kmer_count   input.lgspart.001.fasta.gz  input.lgspart.003.fasta.gz  input.lgspart.005.fasta.gz  input.sgspart.001.fasta.gz  input.sgspart.003.fasta.gz  input.sgspart.005.fasta.gz
moold commented 4 years ago

You should check each subtask log file, such as for this subtask /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/04.merge.bam.sh.work/merge_bam0/RB110-1.nextPolish.sh, its log file is /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/04.merge.bam.sh.work/merge_bam0/RB110-1.nextPolish.sh.e.

Rob-murphys commented 4 years ago

ah sorry i did not know to look there!

The output looks odd here:

hostname
+ hostname
cd /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/04.merge.bam.sh.work/merge_bam0
+ cd /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/04.merge.bam.sh.work/merge_bam0
time /faststorage/project/local-blast/termitoV2/scripts/assembly_annotation/NextPolish/bin/samtools merge -f -b /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/sgs.sort.bam.list --threads 5 -|/faststorage/project/local-blast/termitoV2/scripts/assembly_annotation/NextPolish/bin/samtools markdup --threads 5 -r -s - /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/sgs.sort.bam
+ /faststorage/project/local-blast/termitoV2/scripts/assembly_annotation/NextPolish/bin/samtools merge -f -b /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/sgs.sort.bam.list --threads 5 -
+ /faststorage/project/local-blast/termitoV2/scripts/assembly_annotation/NextPolish/bin/samtools markdup --threads 5 -r -s - /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/sgs.sort.bam
READ 0 WRITTEN 0
EXCLUDED 0 EXAMINED 0
PAIRED 0 SINGLE 0
DULPICATE PAIR 0 DUPLICATE SINGLE 0
DUPLICATE TOTAL 0

real    0m0.024s
user    0m0.009s
sys     0m0.011s
time /faststorage/project/local-blast/termitoV2/scripts/assembly_annotation/NextPolish/bin/samtools index -@ 10 /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/sgs.sort.bam
+ /faststorage/project/local-blast/termitoV2/scripts/assembly_annotation/NextPolish/bin/samtools index -@ 10 /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/sgs.sort.bam

real    0m0.105s
user    0m0.003s
sys     0m0.022s
touch /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/04.merge.bam.sh.work/merge_bam0/RB110-1.nextPolish.sh.done
+ touch /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/04.merge.bam.sh.work/merge_bam0/RB110-1.nextPolish.sh.done

From what i think i am looking at no reads are being read?

moold commented 4 years ago

You should check each subtask log file. Such as /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/03.map.ref.sh.work/map_genome*/RB110-1.nextPolish.sh.e. All subtasks are shown in the main log, followed by key word Throw jobID.

Rob-murphys commented 4 years ago

Is the log file shown above normal though? I am not sure what I am meant to be seeing in each log file so that makes it kind of hard to debug.

moold commented 4 years ago

No, but the reason may be caused by its previous step. I think, maybe , you can run NextPolish with test data, and compare its normal log with yours.

Rob-murphys commented 4 years ago

I have found the issue I think :)

less /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1-OLD/03.kmer_count/03.map.ref.sh.work/map_genome5/RB110-1.nextPolish.sh.e

hostname
+ hostname
cd /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/03.map.ref.sh.work/map_genome5
+ cd /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/03.map.ref.sh.work/map_genome5
time /faststorage/project/local-blast/termitoV2/scripts/assembly_annotation/NextPolish/bin/bwa mem -p -t 10 /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/input.genome.fasta.sgs /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/input.sgspart.005.fastq.gz|/faststorage/project/local-blast/termitoV2/scripts/assembly_annotation/NextPolish/bin/samtools view --threads 5 -F 0x4 -b - |/faststorage/project/local-blast/termitoV2/scripts/assembly_annotation/NextPolish/bin/samtools fixmate -m --threads 5 - - |/faststorage/project/local-blast/termitoV2/scripts/assembly_annotation/NextPolish/bin/samtools sort - -m 2g --threads 5 -o sgs.part005.sort.bam
+ /faststorage/project/local-blast/termitoV2/scripts/assembly_annotation/NextPolish/bin/bwa mem -p -t 10 /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/input.genome.fasta.sgs /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/input.sgspart.005.fastq.gz
+ /faststorage/project/local-blast/termitoV2/scripts/assembly_annotation/NextPolish/bin/samtools view --threads 5 -F 0x4 -b -
+ /faststorage/project/local-blast/termitoV2/scripts/assembly_annotation/NextPolish/bin/samtools fixmate -m --threads 5 - -
+ /faststorage/project/local-blast/termitoV2/scripts/assembly_annotation/NextPolish/bin/samtools sort - -m 2g --threads 5 -o sgs.part005.sort.bam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[E::main_mem] fail to open file `/home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/input.sgspart.005.fastq.gz'.

real    0m0.065s
user    0m0.006s
sys     0m0.035s
touch /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/03.map.ref.sh.work/map_genome5/RB110-1.nextPolish.sh.done
+ touch /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/03.kmer_count/03.map.ref.sh.work/map_genome5/RB110-1.nextPolish.sh.done

Unsure why it is happening throuhg as this is not a file I am directing nextpolish too. my sgs-RB110-1.txt contains: /home/lamma/faststorage/RB110/clean_short-reads_Q30/RB110-1-cleanQ30.fastq

Rob-murphys commented 4 years ago

@moold I have run on test data and this mapping step in each 00.lgs_polish 01.score_chain 02.kmer_count output directory steps. What could be causing nextpolish to look /home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1 here instead of the workdir ?

moold commented 4 years ago

If your short reads file is single end, you need use sgs_options = -unpaired, otherwise, you should write both pair-end files to sgs-RB110-1.txt.

Rob-murphys commented 4 years ago

I have included this now in the .cfg file:

[sgs_option]
  sgs_fofn = ./nextPolish_sgs/sgs-$sample.txt
  sgs_options = -max_depth 100 -bwa -unpaired

The same issue is still occurring:

[E::main_mem] fail to open file '/home/lamma/faststorage/RB110/nextPolish-ouput/RB110-1/input.sgspart.000.fastq.gz'.

moold commented 4 years ago

I wonder if you are sure that the short reads are single-end? Because the short reads are usually pair-end. BTW, I found -unpaired option has a bug and I will fix it in the next release.

Rob-murphys commented 4 years ago

They are merged paired end so believe you just call that as single end?

moold commented 4 years ago

Hi, I have released a new version, which has fixed this bug, you can try it.