Summary
During validation of the ONT nanopore pipeline, there were a few issues that were identified. This PR resolves those issues as well as modifying some of the default parameters to ensure quality of genomes generated by the ONT nanopore pipeline matches the quality of existing genomes submitted to public data repositories.
Issues Addressed
[x] input raw data for the ONT pipeline must end in .fastq / .fastq.gz to be recognized by the artic guppyplex command in ApplyLengthFilter (i.e. will not accept samplename.fq or samplename.fq.gz)
[x] VADR sometimes fails in the RunVadr step (for both Illumina and ONT technologies). The error is as follows...
ERROR in cmalign_run(), cmalign failed in a bad way, see vadr-output/vadr-output.vadr.NC_045512.align.r1.s0.stdout for error output. This occurs after generating the necessary outputs, so added functionality to handle this error.
Parameters Updated
[x] normalise parameter: genomes are higher-quality with normalise = 1000 than the initially-specified default of normalise = 200
[x] medaka_model parameter: medaka model r941_min_high_g360 has produced better results (matching the published version of associated genomes) than the previously specified medaka model. Additionally this value matches the defaul model implemented by medaka library itself.
[x] min_length parameter: the min_length value in the ApplyLengthFilter function results in most of the tested clearlabs data being filtered out. The min_length value has been reduced to match the clearlabs value of 350. The output genome remains high-quality.
Integration Testing
After these updates were implemented, seven ONT Artic v3 datasets obtained from SRA were run through the ONT pipeline and benchmarked against their final GISAID genomes. The results obtained by this pipeline were identical to those uploaded to GISAID (with the exception of a few samples where uploaded genomes contained 2 Ns that were now called as reference), all samples passed VADR checks and looked good when QC-checked in NextClade.
Summary During validation of the ONT nanopore pipeline, there were a few issues that were identified. This PR resolves those issues as well as modifying some of the default parameters to ensure quality of genomes generated by the ONT nanopore pipeline matches the quality of existing genomes submitted to public data repositories.
Issues Addressed
artic guppyplex
command inApplyLengthFilter
(i.e. will not accept samplename.fq or samplename.fq.gz)RunVadr
step (for both Illumina and ONT technologies). The error is as follows...ERROR in cmalign_run(), cmalign failed in a bad way, see vadr-output/vadr-output.vadr.NC_045512.align.r1.s0.stdout for error output
. This occurs after generating the necessary outputs, so added functionality to handle this error.Parameters Updated
normalise
parameter: genomes are higher-quality withnormalise = 1000
than the initially-specified default ofnormalise = 200
medaka_model
parameter: medaka modelr941_min_high_g360
has produced better results (matching the published version of associated genomes) than the previously specified medaka model. Additionally this value matches the defaul model implemented by medaka library itself.min_length
parameter: the min_length value in theApplyLengthFilter
function results in most of the tested clearlabs data being filtered out. Themin_length
value has been reduced to match the clearlabs value of 350. The output genome remains high-quality.Integration Testing After these updates were implemented, seven ONT Artic v3 datasets obtained from SRA were run through the ONT pipeline and benchmarked against their final GISAID genomes. The results obtained by this pipeline were identical to those uploaded to GISAID (with the exception of a few samples where uploaded genomes contained 2 Ns that were now called as reference), all samples passed VADR checks and looked good when QC-checked in NextClade.