Contig vs. consensus, and inconsistencies in assembly size

desireetillo commented 6 months ago

Ask away!

Apologies in advance if I'm posting this in the wrong place.

I ran release v.1.0.0 twice on the same input reads using the default parameters and with a sample sheet with the approx_size set to 4500 (my expected plasmid size is 4.5kb), and I obtained two assemblies that differed in size and quality:

run #	size	mean quality
1	13422	20.19
2	4531	45.87

Here are the read stats for the input reads:

all

run 1 downsampled

run 2 downsampled

The output for my other sample (similar read counts) seemed to be more consistent. What is the recommended fix for this (i.e. increasing the stability of runs)? Should I increase the coverage parameter to use more reads for assembly?

I also note that the .fastq files for each assembly have different headers:

@C_contig_1 for run 1 (incorrect size) and @cluster_001_consensus for run 2 (correct size). Why are these different? Does the header indicate how the final assembly was chosen? Is it safe to assume that the assembly with the contig header is less accurate / of lower quality?

Thanks!

sarahjeeeze commented 5 months ago

Hi, interesting thanks for letting us know. Its hard to say what is causing the inconsistent assemblies without looking at the reads would you mind sharing the full cmd you are using or if possible the input fastq file? is there any chance your sample is contaminated? are there any repetitive regions in your plasmid? - the dot plot should indicate if there are which may be causing the duplicated assembly but ideally the workflow should handle these cases via the deconcat step (which takes in to account approx size) but for some reason its obviously missing it. If you increase coverage it may help improve stability.

For the naming on the fastq header, you are correct the header indicates how the final assembly was chosen and the one with consensus means the medaka consensus step was run successfully, you can go in to the medaka_consensus process folder and check the logs which may indicate the reason medaka consensus was not run.

desireetillo commented 5 months ago

Thanks for your reply. Here is the fastq file for the sample in question, it should be this plasmid: https://www.addgene.org/165148/. The command line run was as follows:

nextflow -log nextflow-job-Gjzg5fQ00ybybvv33gqKPkfp.log run wf-clone-validation --fastq /home/dnanexus/data/fastq --out_dir FANCY_5_11_2024 --db_directory wf-clone-validation-db --threads 4 -c my.config --sample_sheet /home/dnanexus/sample_sheet/Sample_sheet_fixed.csv

The contents of my.config were just to set resources for the process:

executor {
    $local {
        cpus = 8
        memory = "32 GB"
    }
}

The dotplots for the two assemblies:

Run 1

Run 2 (correct size)

sarahjeeeze commented 5 months ago

Hey, I got more robust results with your data set using the Canu assembly tool --assembly_tool canu so i would recommend that.

Looking at the reads I am not really sure why and i will look in to it more when i get time but it often is the case that one of the two assemblers gives better results for different sets of reads.

desireetillo commented 5 months ago

Thanks for looking into this, I will give Canu a try.

sarahjeeeze commented 4 months ago

Hi, any update on this - did you find Canu to be better?

desireetillo commented 4 months ago

Not yet (haven't had time) but I will test in the coming days.

desireetillo commented 4 months ago

Yes, it does seem more consistent using the Canu assembler. Thanks!

epi2me-labs / wf-clone-validation