Closed desireetillo closed 4 months ago
Hi, interesting thanks for letting us know. Its hard to say what is causing the inconsistent assemblies without looking at the reads would you mind sharing the full cmd you are using or if possible the input fastq file? is there any chance your sample is contaminated? are there any repetitive regions in your plasmid? - the dot plot should indicate if there are which may be causing the duplicated assembly but ideally the workflow should handle these cases via the deconcat step (which takes in to account approx size) but for some reason its obviously missing it. If you increase coverage it may help improve stability.
For the naming on the fastq header, you are correct the header indicates how the final assembly was chosen and the one with consensus means the medaka consensus step was run successfully, you can go in to the medaka_consensus process folder and check the logs which may indicate the reason medaka consensus was not run.
Thanks for your reply. Here is the fastq file for the sample in question, it should be this plasmid: https://www.addgene.org/165148/. The command line run was as follows:
nextflow -log nextflow-job-Gjzg5fQ00ybybvv33gqKPkfp.log run wf-clone-validation --fastq /home/dnanexus/data/fastq --out_dir FANCY_5_11_2024 --db_directory wf-clone-validation-db --threads 4 -c my.config --sample_sheet /home/dnanexus/sample_sheet/Sample_sheet_fixed.csv
The contents of my.config were just to set resources for the process:
executor {
$local {
cpus = 8
memory = "32 GB"
}
}
The dotplots for the two assemblies:
Hey, I got more robust results with your data set using the Canu assembly tool --assembly_tool canu
so i would recommend that.
Looking at the reads I am not really sure why and i will look in to it more when i get time but it often is the case that one of the two assemblers gives better results for different sets of reads.
Thanks for looking into this, I will give Canu a try.
Hi, any update on this - did you find Canu to be better?
Not yet (haven't had time) but I will test in the coming days.
Yes, it does seem more consistent using the Canu assembler. Thanks!
Ask away!
Apologies in advance if I'm posting this in the wrong place.
I ran release v.1.0.0 twice on the same input reads using the default parameters and with a sample sheet with the approx_size set to 4500 (my expected plasmid size is 4.5kb), and I obtained two assemblies that differed in size and quality:
Here are the read stats for the input reads:
all
run 1 downsampled
run 2 downsampled
The output for my other sample (similar read counts) seemed to be more consistent. What is the recommended fix for this (i.e. increasing the stability of runs)? Should I increase the coverage parameter to use more reads for assembly?
I also note that the .fastq files for each assembly have different headers:
@C_contig_1
for run 1 (incorrect size) and@cluster_001_consensus
for run 2 (correct size). Why are these different? Does the header indicate how the final assembly was chosen? Is it safe to assume that the assembly with thecontig
header is less accurate / of lower quality?Thanks!