Illumina / DRAGMAP

DRAGEN open-source mapper
Other
156 stars 31 forks source link

mapper produces alignments with impossible coordinates #12

Open biork opened 3 years ago

biork commented 3 years ago

I am mapping some FASTQs using a built-from-source dragen-os on:

wget "ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/p13/hg38.p13.fa.gz" with MD5 9930e3071308dc5e9b546934b43a2323 hg38.p13.fa.gz

I have discovered that an apparently problem-free run of the mapper generated 296253 alignments with start coordinates far beyond the target sequence's length. Notably, this error only arises on the last sequence in hg38.p13.fa (chrX_ML143384v1_fix), suggesting some kind of search termination bug in dragen(?)

Griffan commented 2 years ago

Hi, @biork. I have tried to reproduce this problem using our internal dataset but failed. Could you please share with us your: version number, cmdline, and maybe a few fastq reads that are aligned to this contig? Thanks!

biork commented 2 years ago

Hi Griffan,

It was easily reproduced once I regenerated the input FASTQs (extracted from earlier bwa alignments).

For what it's worth there is clearly a "theme" of repeat sequences with CCATT motifs in the incorrect output, though it's not present in all bad alignments.

dragen-os --version 1.2.1-2-ge4050868 using boost 1.69.0

LD_LIBRARY_PATH=../lib/boost/lib ../bin/dragen-os \ -r ../resources/ucsc/hashtables \ -1 sample.1.fq.gz \ -2 sample.2.fq.gz \ --num-threads 48 2> dragmap.err \ | python3 ../bin/verify-mapping-coords.py 2> badalign.sam \ | samtools view -b -o sample-nmord.bam &

Supplying reads is tricky since it's real data. I can probably safely provide a few, need to check, but not going to post those to github.

I'm good with gdb if there is debugging you'd like me to try on my end.

biork commented 2 years ago

More information: running dragen with a new (different from above) reference and more real data, and again it is emitting impossible coordinates for the last contig in the reference FASTA and only the last contig. In this case the last contig is mitochondrial, but I don't think this is relevant.