I ran Picard's ValidateSamFile on my aligned human exome sequencing data, and encountered the following:
ERROR::INVALID_TAG_NM:Record 1, Read name HWI-D00328:58:H7EAEADXX:2:1108:10580:81442, NM tag (nucleotide differences) in file [2] does not match reality [4]
ERROR::INVALID_TAG_NM:Record 170972, Read name HWI-D00328:58:H7EAEADXX:2:2211:14392:16693, NM tag (nucleotide differences) in file [0] does not match reality [19]
ERROR::INVALID_TAG_NM:Record 1437790, Read name HWI-D00328:58:H7EAEADXX:1:2104:5326:19260, NM tag (nucleotide differences) in file [1] does not match reality [3]
... And so on for some 1500 records in total for this particular BAM file.
Looking at the first offending read in the BAM file (i.e. record 1):
I then ran samtools calmd -bAr sorted.bam $ref > sorted_calmd.bam.
E.g. First offending record:
[bam_fillmd1] different NM for read 'HWI-D00328:58:H7EAEADXX:2:1108:10580:81442': 2 -> 4
[bam_fillmd1] different MD for read 'HWI-D00328:58:H7EAEADXX:2:1108:10580:81442': '1G6C92' -> '0N0N0N5C92'
... and so forth for the remaining problem reads.
While samtools calmd seems to resolve the problem, I'd like to understand why this happens in the first place and whether this may point to a bug.
To be sure this NM/MD tag discrepancy wasn't the result of the merging process vs. the alignment, I ran ValidateSamFile on one of the BAM files prior to merging, and still encountered these errors.
Commands resulting in the offending BAM:
## Align
ref="/home/shared/hg_align_db/GRCh38_gencode_primary/GRCh38.primary_assembly.genome.fa"
bwa="/home/shared/programs/bwa-mem2/./bwa-mem2"
for sample in Sample*
do cd /mnt/bdata/shared/SF9495/exome/raw_data/$sample
for lane in "L001" "L002"
do
for ea in $(ls *R1* | grep "$lane");
do
R1=$ea
R2=$(echo $R1 | sed "s/R1/R2/")
$bwa mem -t 10 -T 0 -R "@RG\tID:1\tLB:$sample\tPL:illumina\tSM:$sample\tPU:$lane" $ref $R1 $R2 |
samtools view --threads 10 -hb -o ../../bam/$(echo $ea | sed "s/fastq.gz/bam/")
done;
done;
done;
## Merge/sort
cd /mnt/bdata/shared/SF9495/exome/bam
picard="/home/shared/programs/Picard/picard.jar"
for sample in $(ls ../raw_data | grep Sample.* | sed "s/Sample_//")
do sem -j 3
bamlist=$(for f in "$sample"*; do echo -n "I=$f "; done)
java -jar $picard MergeSamFiles $bamlist O=merged_sorted/"$sample"_sorted.bam
done;
Have you checked if this NM count is already present in the original bwa mem v0.7.17 output, or is this a difference between the original mem and mem2?
I ran Picard's ValidateSamFile on my aligned human exome sequencing data, and encountered the following:
... And so on for some 1500 records in total for this particular BAM file.
Looking at the first offending read in the BAM file (i.e. record 1):
I then ran
samtools calmd -bAr sorted.bam $ref > sorted_calmd.bam
. E.g. First offending record:... and so forth for the remaining problem reads.
While samtools calmd seems to resolve the problem, I'd like to understand why this happens in the first place and whether this may point to a bug.
To be sure this NM/MD tag discrepancy wasn't the result of the merging process vs. the alignment, I ran ValidateSamFile on one of the BAM files prior to merging, and still encountered these errors.
Commands resulting in the offending BAM:
Using bwa mem v. 2.2.1 Reference file is Gencode primary assembly (GRCh38)
cpuinfo:
gcc -v