genome-in-a-bottle / giab_data_indexes

This repository contains data indexes from NIST's Genome in a Bottle project.
232 stars 71 forks source link

HG002 2x250 BAMs are double-covered by identical reads #5

Closed ctsa closed 5 years ago

ctsa commented 5 years ago

The following BAM file for HG002:

ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/NIST_Illumina_2x250bps/novoalign_bams/HG002.hs37d5.2x250.bam

...seems to erroneously contain 2 copies of every read pair. For instance a simple view of the bam shows:

D00360:97:H2YVMBCXX:2:1107:18923:87587  163     1       10114   6       42M2S   =       10407   337     TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCTT    DDDDDIIIIIIIIIIIIIHIIIIIHHIHIIIIIIII=<G?HI11    PG:Z:novoalign  AS:i:21 UQ:i:21 NM:i:0  MD:Z:42 PQ:i:22 SM:i:0  AM:i:0
D00360:97:H2YVMBCXX:2:1107:18923:87587  163     1       10114   6       42M2S   =       10407   337     TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCTT    DDDDDIIIIIIIIIIIIIHIIIIIHHIHIIIIIIII=<G?HI11    PG:Z:novoalign  AS:i:21 UQ:i:21 NM:i:0  MD:Z:42 PQ:i:22 SM:i:0  AM:i:0
D00

...and so on for every read.

cmdcolin commented 5 years ago

The readme suggests this was fixed https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_Illumina_2x250bps/novoalign_bams/README_update_feb2019

Update: February 6, 2019

Because of an error in merging BAM files, the files previously in this directory had read duplicates. The reads have been realigned/re-merged with the same version of and options for novoalign as described below, and the current BAM files in the directory are now accurate.
jzook commented 5 years ago

Yes, this is now fixed - thanks for the reminder to close!