assemble_refbased aligns 0 reads

marnieya commented 8 months ago

Hello, I am having an issue where I convert paired fastq's to ubam, then run the assemble_refbased pipeline, but 0 reads align despite no errors coming up in either of the steps. These are fastq's simulated from SARS-CoV-2, but I have validated with an external tool showing no clear errors and they also align with bwa-mem (screenshots below of the fastq's, the validation, and the alignment stats)

Any idea why this is happening or where to look for more clues? Parameters I should adjust? Thanks!

tomkinsc commented 8 months ago

Thanks for posting screenshots. If you can share a copy of the assemble_refbased stdout and stderr output logs and either the input fastq and bam files, or the exact invocations used to produce them, I can try to reproduce the issue on my end and trace the cause.

That said, if the fastq reads shown in the first screenshot are representative, the simulated basecall quality scores may be part of the story, depending on the aligner specified for the workflow and options passed to it; the Q scores shown are all quite low (ASCII-to-Q-score table). What's the average basecall score reported for the uBam if you grep for average quality in the samtools stats output?

Tangentially related, but what tool was used to create the simulated reads? (mason2 is what Heng Li used to simulate short reads in the minimap2 paper, and what the minimap2 docs suggest for evaluating mapping quality).

marnieya commented 8 months ago

Hello, apologies for the late response, and thank you for the pointers. I intentionally simulated some low quality reads, but I am still having the issue with more normal qualities too. Here is another case with better quality reads:

These also fail to align using assemble_refbased but align with bwa mem; the samtools stats file has average quality 28.1 for these.

These are the stderr and stdout for assemble_refbased:

stderr.txt stdout.txt

I also attached two fastq's which are a minimal example -- they are just the first reads from the files in the screenshots (they still map with bwa mem) if it would help reproduce the problem.

short_5x_cov.simulated_215_read1.txt short_5x_cov.simulated_215_read2.txt

My fastq's are generated using a custom perl script, and run through the pipeline with default commands, i.e.:

miniwdl run "fastq_to_ubam.wdl" FastqToUBAM.fastq_1=${fq1} FastqToUBAM.fastq_2=${fq2} FastqToUBAM.library_name="5x_cov.simulated_215" FastqToUBAM.platform_name="ILLUMINA" FastqToUBAM.sample_name="5x_cov.simulated_215"

miniwdl run "assemble_refbased.wdl" reads_unmapped_bams=${ubam_path} reference_fasta="SARS_CoV_2_Wuhan_ref_genome.fasta"

The reference I am using is the Wuhan-Hu-1 genome for SARS-CoV-2: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta

Thanks in advance for continuing to help with this issue!

marnieya commented 8 months ago

Hi again, I think I figured out the issue. Looking at thefastq_to_ubam step: I now see that the bam generated by fastq_to_ubam has a lot of "!" which is the lowest quality -- definitely not the same as you can see in the fastq screenshots above (mostly "<" and other higher qualities). I ran Picard outside of the pipeline step and indeed, the platform is autodetected as Solexa:

I ran Picard again but this time manually specified -V=Standard (for phred+33 scaling) and after that, the output looks correct. However, I don't see that as a parameter for the fastq_to_ubam pipeline. How do I specify the encoding so the qualities aren't misread?

tomkinsc commented 8 months ago

Since we shell out to picard, the quality encoding detection of is ultimately determined by htsjdk, though as you've noted 1) it can be overridden, and 2) the FastqToUBAM task does not allow for overrides. In the new PR #518, I've exposed a new input for the task, additional_picard_options, allowing QUALITY_FORMAT or other options to be passed as a string to picard FastqToSam (usual picard arg format, space separated).

As for where these reads are getting lost, I converted the fastqs you posted and the one read pair within to a bam file and ran it through assemble_refbased. Inspecting the intermediate oututs, the data lack reads going into the ivar trimming stage of the pipeline.

This means they were lost upstream in the initial align_to_ref call.

The align_to_ref task outputs two bam files: one with all mapped reads (aligned_bam), and one that we filter to only include the highest quality read mappings (aligned_only_reads_bam).

In this case, the one with all reads does contain mapped reads, however the filtered bam does not.

The filtered bam is what is passed downstream to tasks in the rest of the workflow.

The filtering function ultimately filters based on a few criteria, depending on the arguments specified and whether the reads are single-end or paired-end.

After the initial mapping of the two read mates you posted above, the mapped reads have SAM flag values of 65 and 129 for read1 and read2, respectively, indicating they're mapped and paired, but notably, bit 0x2 is not set ("read mapped in proper pair"=false).

Our filtering excludes all reads not marked as being mapped in a proper pair to avoid including the split mappings, chimeric reads, etc. we've occasionally encountered during past projects working with data from various library prep protocols. Normally this filtering is not a problem for us.

As for why the example reads do not have the "proper pair" bit set after initial alignment, and what the requirements are for having a "read mapped in [a] proper pair", as samtools says, that bit indicates "each segment [is] properly aligned according to aligner," so it's up to each aligner to decide.

The default aligner used in assemble_refbased is minimap2. I couldn't find documentation describing minimap2's criteria for deciding whether a read pair is "properly paired", leaving the matter a question for Heng Li or a dive into the source code; it should be in part a function of the mapping score, taking account insert size/distance between mapped mates, etc.

I did find a comment where Heng Li mentions "Minimap2 won't work well at 15% error rate on ~150bp reads. Probably stampy or bwa-mem will work better."

Most users of these pipelines generally do not want to keep low-quality reads or marginal mappings, so I'd suggest that you consider modifying the workflow and task WDLs on your end to suit your needs.

Two potential changes would be to modify assembly_refbased.wdl so aligned_bam is used downstream rather than aligned_only_reads_bam, and to set min_coverage=1 (or other value of your choosing) in the workflow call. Then pass --keep_all_reads to the assembly.py refine_assembly call in the refine_assembly_with_aligned_reads task. in That should at least get reads to GATK in the call_consensus task for local (Smith-Waterman) realignment of reads for indel calling, though the realignment step enforces a minimum basecall quality threshold of 15 so lower-quality reads may still be lost.

In the assemble_refbased workflow, you can also change the aligner used to bwa mem by setting the aligner input to "bwa". Perhaps also try different settings in the align_to_ref_options and align_to_self_options input maps of the WDL to make the alignment more permissive of low-quality reads—possibly setting --no-pairing for minimap2 with a low value for -w (though performance may take a hit), or for bwa mem, perhaps -S.

marnieya commented 8 months ago

Thank you so much for the detailed response! I was able to get it to work by changing the Picard options when calling fastq_to_ubam and lowering min_coverage / setting aligner="bwa" when calling assemble_refbased. If I have further issues I will try the other fixes-- much appreciated!

broadinstitute / viral-pipelines

assemble_refbased aligns 0 reads #509