Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
350 stars 52 forks source link

Error using read_type = hifi and fastq.gz file #123

Closed davised closed 2 years ago

davised commented 2 years ago

Describe the bug

nd.asm.fasta file is empty, with the IndexError produced shown below.

Error message

The pid log did not provide any error message, but I got this on stderr because the files are empty:

Traceback (most recent call last):
  File "/local/cluster/bin/nextDenovo", line 856, in <module>
    main(args)
  File "/local/cluster/bin/nextDenovo", line 827, in main
    asm, stat = gather_ctg_cns_output(cfg, task.subtasks, seq_info)
  File "/local/cluster/bin/nextDenovo", line 291, in gather_ctg_cns_output
    out = cal_n50_info(stat, asm + '.stat')
  File "/local/cluster/NextDenovo-2.4.0/lib/kit.py", line 171, in cal_n50_info
    out += "%-5s %18d%20s\n" % ("Min.", stat[-1], '-')
IndexError: list index out of range

Config file

$ cat run.cfg
[General]
job_type = local
job_prefix = nextDenovo
task = all # 'all', 'correct', 'assemble'
rewrite = yes # yes/no
deltmp = no
rerun = 3
parallel_jobs = 20
input_type = corrected # raw, corrected
read_type = hifi # clr, ont, hifi
input_fofn = ./input.fofn
workdir = ./01_rundir
# usetempdir = /data/davised/nextDenovo

[correct_option]
read_cutoff = 1k
genome_size = 100m # estimated genome size
pa_correction = 3
sort_options = -m 20g -t 20
minimap2_options_raw =  -t 8
correction_options = -p 15

[assemble_option]
minimap2_options_cns = -t 8 -k17 -w17
nextgraph_options = -a 1

Input fofn

$ cat input.fofn
m64047_210502_022250.ccs.fastq.gz

Operating system

$ lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.2.1511 (Core)
Release:    7.2.1511
Codename:   Core

GCC

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/local/cluster/centos/devtoolset-7/root/usr/bin/../libexec/gcc/x86_64-redhat-linux/7/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-7/root/usr --mandir=/opt/rh/devtoolset-7/root/usr/share/man --infodir=/opt/rh/devtoolset-7/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --enable-plugin --with-linker-hash-style=gnu --enable-initfini-array --with-default-libstdcxx-abi=gcc4-compatible --with-isl=/builddir/build/BUILD/gcc-7.2.1-20170829/obj-x86_64-redhat-linux/isl-install --enable-libmpx --enable-gnu-indirect-function --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 7.2.1 20170829 (Red Hat 7.2.1-1) (GCC)

Python

$ python3 --version
Python 3.7.2

NextDenovo

$ nextDenovo -v
nextDenovo v2.4.0

To Reproduce (Optional) Use a input fastq.gz hifi file and you will receive the same IndexError.

I think a check that the split fastas in the 01.split_seed.sh.work/split_seed0 folder are valid, and another check at the 02.cns_align.sh.work folder that the output files each have something in them after alignment (unless 0 size files might be expected in some cases) would help resolve this type of issue in the future.

Additional context (Optional)

output folder structure & sizes

01_rundir
├── [   6]  01.raw_align
│   ├── [ 234]  01.db_stat.sh
│   ├── [   0]  01.db_stat.sh.done
│   ├── [   3]  01.db_stat.sh.work
│   │   └── [   6]  db_stat0
│   │       ├── [ 504]  nextDenovo.sh
│   │       ├── [   0]  nextDenovo.sh.done
│   │       ├── [1013]  nextDenovo.sh.e
│   │       └── [  30]  nextDenovo.sh.o
│   └── [2.5K]  input.reads.stat
├── [   8]  02.cns_align
│   ├── [ 161]  01.split_seed.sh
│   ├── [   0]  01.split_seed.sh.done
│   ├── [   3]  01.split_seed.sh.work
│   │   └── [  18]  split_seed0
│   │       ├── [7.3G]  cns0.fasta
│   │       ├── [ 16K]  cns0.fasta.idx
│   │       ├── [7.0G]  cns1.fasta
│   │       ├── [ 16K]  cns1.fasta.idx
│   │       ├── [7.3G]  cns2.fasta
│   │       ├── [ 16K]  cns2.fasta.idx
│   │       ├── [7.3G]  cns3.fasta
│   │       ├── [ 16K]  cns3.fasta.idx
│   │       ├── [7.5G]  cns4.fasta
│   │       ├── [ 16K]  cns4.fasta.idx
│   │       ├── [7.1G]  cns5.fasta
│   │       ├── [ 16K]  cns5.fasta.idx
│   │       ├── [ 444]  nextDenovo.sh
│   │       ├── [   0]  nextDenovo.sh.done
│   │       ├── [1.2K]  nextDenovo.sh.e
│   │       └── [  30]  nextDenovo.sh.o
│   ├── [8.2K]  02.cns_align.sh
│   ├── [   0]  02.cns_align.sh.done
│   └── [  23]  02.cns_align.sh.work
│       ├── [   8]  cns_align00
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 675]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.2G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align01
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 686]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.2G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align02
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 686]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.2G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align03
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 686]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.2G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align04
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 686]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.2G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align05
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 686]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.2G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align06
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 675]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.2G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align07
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 686]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.2G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align08
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 686]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.2G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align09
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 686]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.2G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align10
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 686]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.2G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align11
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 675]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.3G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align12
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 686]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.3G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align13
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 686]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.3G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align14
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 686]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.3G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align15
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 675]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.0G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align16
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 686]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.0G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align17
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 686]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.0G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align18
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 675]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.5G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       ├── [   8]  cns_align19
│       │   ├── [   2]  cns.filt.dovt.ovl
│       │   ├── [   0]  cns.filt.dovt.ovl.bl
│       │   ├── [ 686]  nextDenovo.sh
│       │   ├── [   0]  nextDenovo.sh.done
│       │   ├── [7.5G]  nextDenovo.sh.e
│       │   └── [  30]  nextDenovo.sh.o
│       └── [   8]  cns_align20
│           ├── [   2]  cns.filt.dovt.ovl
│           ├── [   0]  cns.filt.dovt.ovl.bl
│           ├── [ 675]  nextDenovo.sh
│           ├── [   0]  nextDenovo.sh.done
│           ├── [7.0G]  nextDenovo.sh.e
│           └── [  30]  nextDenovo.sh.o
└── [  15]  03.ctg_graph
    ├── [2.6K]  01.ctg_graph.input.ovls
    ├── [ 732]  01.ctg_graph.input.seqs
    ├── [ 287]  01.ctg_graph.sh
    ├── [   0]  01.ctg_graph.sh.done
    ├── [   3]  01.ctg_graph.sh.work
    │   └── [   8]  ctg_graph0
    │       ├── [   0]  nd.asm.p.fasta
    │       ├── [   1]  nd.asm.p.fasta.blc
    │       ├── [ 568]  nextDenovo.sh
    │       ├── [   0]  nextDenovo.sh.done
    │       ├── [1.9K]  nextDenovo.sh.e
    │       └── [  30]  nextDenovo.sh.o
    ├── [2.3K]  02.ctg_align.sh
    ├── [   0]  02.ctg_align.sh.done
    ├── [   8]  02.ctg_align.sh.work
    │   ├── [   8]  ctg_align0
    │   │   ├── [  92]  cns2.fasta.sort.bam
    │   │   ├── [  16]  cns2.fasta.sort.bam.bai
    │   │   ├── [ 680]  nextDenovo.sh
    │   │   ├── [   0]  nextDenovo.sh.done
    │   │   ├── [1.4K]  nextDenovo.sh.e
    │   │   └── [  30]  nextDenovo.sh.o
    │   ├── [   8]  ctg_align1
    │   │   ├── [  92]  cns3.fasta.sort.bam
    │   │   ├── [  16]  cns3.fasta.sort.bam.bai
    │   │   ├── [ 680]  nextDenovo.sh
    │   │   ├── [   0]  nextDenovo.sh.done
    │   │   ├── [1.4K]  nextDenovo.sh.e
    │   │   └── [  30]  nextDenovo.sh.o
    │   ├── [   8]  ctg_align2
    │   │   ├── [  92]  cns0.fasta.sort.bam
    │   │   ├── [  16]  cns0.fasta.sort.bam.bai
    │   │   ├── [ 680]  nextDenovo.sh
    │   │   ├── [   0]  nextDenovo.sh.done
    │   │   ├── [1.4K]  nextDenovo.sh.e
    │   │   └── [  30]  nextDenovo.sh.o
    │   ├── [   8]  ctg_align3
    │   │   ├── [  92]  cns1.fasta.sort.bam
    │   │   ├── [  16]  cns1.fasta.sort.bam.bai
    │   │   ├── [ 680]  nextDenovo.sh
    │   │   ├── [   0]  nextDenovo.sh.done
    │   │   ├── [1.4K]  nextDenovo.sh.e
    │   │   └── [  30]  nextDenovo.sh.o
    │   ├── [   8]  ctg_align4
    │   │   ├── [  92]  cns4.fasta.sort.bam
    │   │   ├── [  16]  cns4.fasta.sort.bam.bai
    │   │   ├── [ 680]  nextDenovo.sh
    │   │   ├── [   0]  nextDenovo.sh.done
    │   │   ├── [1.4K]  nextDenovo.sh.e
    │   │   └── [  30]  nextDenovo.sh.o
    │   └── [   8]  ctg_align5
    │       ├── [  92]  cns5.fasta.sort.bam
    │       ├── [  16]  cns5.fasta.sort.bam.bai
    │       ├── [ 680]  nextDenovo.sh
    │       ├── [   0]  nextDenovo.sh.done
    │       ├── [1.4K]  nextDenovo.sh.e
    │       └── [  30]  nextDenovo.sh.o
    ├── [ 774]  03.ctg_cns.input.bams
    ├── [1.4K]  03.ctg_cns.sh
    ├── [   0]  03.ctg_cns.sh.done
    ├── [   5]  03.ctg_cns.sh.work
    │   ├── [   7]  ctg_cns0
    │   │   ├── [   0]  nd.asm.f.part000.fasta
    │   │   ├── [ 759]  nextDenovo.sh
    │   │   ├── [   0]  nextDenovo.sh.done
    │   │   ├── [3.3K]  nextDenovo.sh.e
    │   │   └── [  30]  nextDenovo.sh.o
    │   ├── [   7]  ctg_cns1
    │   │   ├── [   0]  nd.asm.f.part001.fasta
    │   │   ├── [ 759]  nextDenovo.sh
    │   │   ├── [   0]  nextDenovo.sh.done
    │   │   ├── [3.3K]  nextDenovo.sh.e
    │   │   └── [  30]  nextDenovo.sh.o
    │   └── [   7]  ctg_cns2
    │       ├── [   0]  nd.asm.f.part002.fasta
    │       ├── [ 759]  nextDenovo.sh
    │       ├── [   0]  nextDenovo.sh.done
    │       ├── [3.3K]  nextDenovo.sh.e
    │       └── [  30]  nextDenovo.sh.o
    └── [   0]  nd.asm.fasta

You note that the cns_align subdirs in 02.cns_align.sh.work have large nextDenovo.sh.e files, and each is giving a warning about alignment length of 0.

I traced the error back to the 01.split_seed.sh command. The split_cns.py expects a fasta file so when a fastq file is provided, the output is a mess of invalid files. I can resolve the error by converting my fastq to fasta.

$ cat 01.split_seed.sh
/local/cluster/bin/python3 /local/cluster/NextDenovo-2.4.0/lib/split_cns.py  -f /nfs1/MICRO/Bartholomew_Lab/davised/pacbio/nextdenovo/./input.fofn -l 18407 -c 6

I'm not sure if adding fastq support into split_cns.py makes sense or if disallowing fastq as input to the hifi workflow makes sense.

Thanks for this software and I'm excited to compare the outputs to my other assemblies (based on what my colleagues have told me this should compare favorably).

davised commented 2 years ago

While I was typing up this bug report the fasta version finished.

$ cat nd.asm.fasta.stat
Type           Length (bp)            Count (#)
N10             27062032                   1
N20             27062032                   1
N30             25992854                   2
N40             25992854                   2
N50             11566870                   3
N60             11193937                   4
N70              7931554                   5
N80              7209047                   6
N90              2551250                   8

Min.               24333                   -
Max.            27062032                   -
Ave.             1747497                   -
Total          108344837                  62

Very happy with these stats! I still need to make sure everything looks OK since I have imperfect inputs (not a mono culture) but I'm optimistic currently.

Cheers

moold commented 2 years ago

This is a known bug, see #119 , you can transform the fastq file to fasta file to avoid this error. I will fix it in next release.

davised commented 2 years ago

Thanks for taking the time to respond. I look forward to your next release.

Neato-Nick commented 2 years ago

Can I use a fasta.gz or does it need to be uncompressed fasta?

davised commented 2 years ago

fasta.gz should work fine. can always test it and then gunzip if it fails.

Neato-Nick commented 2 years ago

Confirmed that gzipped fasta does work fine