gbouras13 / hybracter

Automated long-read first bacterial genome assembly tool implemented in Snakemake using Snaketool.
MIT License
95 stars 8 forks source link

Error: the assembly_1 and assembly_2 assemblies need to contain the same number of sequences #78

Open rukjis opened 4 months ago

rukjis commented 4 months ago

Hi,

Thank you for this great tool.

I am trying to assemble a bacterial genome with Nanopore long reads and Illumina paired end short reads using hybrid-single:

singularity exec hybracter_0.7.3.sif hybracter hybrid-single -l hybridassem/output/chopper/Iso16_chop.fastq -1 hybridassem/output/fastp/Iso16_fp_R1.fastq -2 hybridassem/output/fastp/Iso16_fp_R2.fastq -s Iso16 -o hybridassem/output/hybracter/Iso16_hybracter -t 32 --flyeModel --nano-raw

But the assembly fails with the following error:

Error: the assembly_1 and assembly_2 assemblies need to contain the same number
of sequences

The error log suggests this after this message:

Waiting at most 5 seconds for missing files.
MissingOutputException in rule compare_assemblies_medaka_round_1 in file /opt/miniforge3/lib/python3.10/site-packages/hybracter/workflow/rules/polishing/long_read_polish.smk, line 68:
Job 59  completed successfully, but some output files are missing. Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
hybridassem/output/hybracter/Iso16_hybracter/supplementary_results/comparisons/Iso16/medaka_round_1_vs_pre_polish.txt

Where should I provide --latency-wait? Am I supposed to give it a value in seconds? Or does this issue have to be fixed in another manner?

I have included the error and output log files here for your reference. Any help would be appreciated.

hybracter_err_Iso16.txt

hybracter_out_Iso16.txt

gbouras13 commented 4 months ago

Hi @rukjis,

My guess this is a bug in Hybracter - no need to pass --latency-wait.

From the output:

Loading assemblies (2024-05-11 21:03:51)
hybridassem/output/hybracter/Iso16_hybracter/processing/complete/dnaapler/Iso16_pre_chrom/Iso16_reoriented.fasta
pre_polish  contig_5: 1,208,607 bp
pre_polish  contig_3: 1,411,941 bp
pre_polish  contig_2: 3,104,156 bp

hybridassem/output/hybracter/Iso16_hybracter/processing/complete/dnaapler/Iso16/Iso16_reoriented.fasta
medaka_round_1  contig_5: 1,208,540 bp
medaka_round_1  1: 1,292,263 bp
medaka_round_1  contig_3: 1,411,927 bp
medaka_round_1  contig_2: 3,104,021 bp

There is some bug in the output for your data - not exactly sure why. I'll try and replicate the errors - it would be fantastic if you could send me the reads somehow (george.bouras@adelaide.edu.au)

But your output also suggests to me that your isolate's chromosome hasn't properly assembled with Flye (you can check the Flye assembly info in hybridassem/output/hybracter/Iso16_hybracter/supplementary_results/flye_individual_summaries/Iso16_assembly_info.txt ).

I would assume the likely chromosome did not circularise and is not complete. It looks like your isolate is really probably 5-6mbp chromosome if I had to guess (based on a 3Mbp and 2 1Mbp contigs).

Therefore, the default -c 1000000 is too low I'd pass -c 4000000 (under 5Mbp) and/or --subsample_depth 200.

This should result in a complete assembly and probably your problem disappearing, but I do want to fix this bug nonetheless!

George