Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
350 stars 52 forks source link

segmentation fault after ctg_graph was done #153

Closed HippoYI closed 1 year ago

HippoYI commented 1 year ago

Describe the bug I am running an assembly of about 300M genome(0.6% het rate) using a 512GB machine. The Ultralong reads is about 27X.

Error message The program run well and get nd.asm.p.fasta after runing ctg_graph, but then the program stopped and reported segmentation fault (core dumped). This meant that the program failed to run "02.ctg_align" and "03.ctg_cns". I have tried many parameters in run.cfg and even change to a machine wit 2TB memory, but the error still occurred at the same point.

Input data Total base count=8358015912bp, sequencing depth=27X, average/N50 read length=100709

Config file [General] job_type = local job_prefix = nextDenovo task = all rewrite = yes deltmp = yes parallel_jobs = 2 input_type = raw read_type = ont input_fofn = input.fofn workdir = 01_rundir

[correct_option] read_cutoff = 1k genome_size = 300m sort_options = -m 40g -t 5 minimap2_options_raw = -t 5 pa_correction = 5 correction_options = -p 4

[assemble_option] minimap2_options_cns = -t 5 nextgraph_options = -a 1 -q 10

Operating system CentOS Linux release 7.9.2009

GCC

Python Python 2.7.5 and Python 3.6.2

NextDenovo 2.5.0

As the FAQ mentioned that nd.asm.p.fasta contains more structural & base errors than nd.asm.fasta, so I really want to solve this. Any ideas or suggestions on how to fix this problem?

Thank you!

moold commented 1 year ago

Could you share the failed subtask log here?

HippoYI commented 1 year ago

I posted the running log and the **.e file in "ctg_graph1" directory which point to the last and the failed subtask. I am not sure that's what you need. If not, please let me know. nextDenovo.sh.e.txt pid6864.log.txt

moold commented 1 year ago

See the instructions below: Error message Paste the complete log message, include the main task log and failed subtask log. The main task log is usually located in your working directory and is named pidXXX.log.info and the main task log will tell you the failed subtask log in the last few lines, such as:

[ERROR] 2020-07-01 11:06:57,184 cns_align failed: please check the following logs:
[ERROR] 2020-07-01 11:06:57,185 ~/NextDenovo/test_data/01_rundir/02.cns_align/02.cns_align.sh.work/cns_align0/nextDenovo.sh.e
HippoYI commented 1 year ago

As I didn't save the running situation at the screen last time, I rerun the program in the last 2 days. As you can see in the "snapshot.jpg", the subtask did not give any error message, just "Segmentation fault (core dumped)" after ctg_graph was done.

snapshot

moold commented 1 year ago

Hi, Acutally, you don't have to rerun the whole process, just see here to continue running unfinished tasks.

For the segmentation falut, I guess this is caused by the calgs function in the file lib/kit.py, so you can replace this function with the following python code:

def calgs(infile):
    from Bio import SeqIO
    gs = 0
    for seq_record in SeqIO.parse(infile, "fasta"):
        gs += len(seq_record.seq)
    return gs
HippoYI commented 1 year ago

Hi, I replaced the calgs function in kit.py, and got these info:

[56473 INFO] 2022-09-07 15:27:58 skip step: db_split [56473 INFO] 2022-09-07 15:27:58 skip step: raw_align [56473 INFO] 2022-09-07 15:27:58 skip step: sort_align [56473 INFO] 2022-09-07 15:27:58 skip step: seed_cns [56473 INFO] 2022-09-07 15:27:58 seed_cns finished, and final corrected reads file: [56473 INFO] 2022-09-07 15:27:58 /data/yixin/projects/JH_genome_analysis/New_genome_assembly_related/NextD-assembly/./01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns*/cns.fasta [56473 INFO] 2022-09-07 15:27:58 skip step: cns_align [56473 INFO] 2022-09-07 15:27:58 skip step: ctg_graph Segmentation fault (core dumped)

moold commented 1 year ago

oo, so, Next, try to change this line total_seed_len = cal_total_seed_len(get_seed_files(idx=True)) in file nextDenovo to total_seed_len =1000 and this line minlen = cal_minlen_from_idx(part_idx_files, len(part_idx_files), gs * mindepth - total_seed_len) in file nextDenovo to minlen = 2000

HippoYI commented 1 year ago

wow, great! ... It worked after changing those two lines, and now I can finally get the "nd.asm.fasta". I am just curious about the changes, will it affect the final contigs corrections when the total seed length was fixed to 1000?

moold commented 1 year ago

For your data, it should not.

HippoYI commented 1 year ago

Thanks so much. I really appreciate your help in resolving this !