corresponding BAM file for NA12877.vcf.gz

ksw9 commented 6 years ago

Hi, We are exploring the platinum truthset variant calls for NA12877 (aws s3 cp s3://platinum-genomes/2017-1.0 pg2017 --recursive). Which is the correct corresponding BAM file from which these variant calls were made (i.e. the BAM to match this VCF)?

The ENA BAM file for NA12877 (ftp://ftp.sra.ebi.ac.uk/vol1/ERA172/ERA172924/bam/NA12877_S1.bam) has a header which does not match the ref genome Homo_sapiens/NCBI/GRCh38Decoy/Sequence/WholeGenomeFasta/genome.fa reported in the NA12877.vcf.gz.

For example the BAM file header:

@SQ SN:chrM LN:16571
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chr10    LN:135534747
@SQ SN:chr11    LN:135006516
@SQ SN:chr12    LN:133851895
@SQ SN:chr13    LN:115169878
@SQ SN:chr14    LN:107349540
@SQ SN:chr15    LN:102531392
@SQ SN:chr16    LN:90354753
@SQ SN:chr17    LN:81195210
@SQ SN:chr18    LN:78077248
@SQ SN:chr19    LN:59128983
@SQ SN:chr20    LN:63025520
@SQ SN:chr21    LN:48129895
@SQ SN:chr22    LN:51304566
@SQ SN:chrX LN:155270560
@SQ SN:chrY LN:59373566
@RG ID:NA12877  SM:NA12877

While the reference fasta dictionary seems to be a different reference sequence.

less  /ifs/labs/andrews/walter/varcal/data/platinum/Homo_sapiens/NCBI/GRCh38Decoy/Sequence/WholeGenomeFasta/genome.fa.fai 

chr1    248956422       112     80      81
chr2    242193529       252068602       80      81
chr3    198295559       497289663       80      81
chr4    190214555       698064029       80      81
chr5    181538259       890656400       80      81
chr6    170805979       1074464000      80      81
chr7    159345973       1247405166      80      81
chr8    145138636       1408743076      80      81
chr9    138394717       1555696057      80      81
chr10   133797422       1695820821      80      81
chr11   135086622       1831290824      80      81
chr12   133275309       1968066142      80      81
chr13   114364328       2103007506      80      81
chr14   107043718       2218801515      80      81
chr15   101991189       2327183393      80      81
chr16   90338345        2430449584      80      81
chr17   83257441        2521917271      80      81
chr18   80373285        2606215543      80      81
chr19   58617616        2687593620      80      81
chr20   64444167        2746944069      80      81
chr21   46709983        2812193914      80      81
chr22   50818468        2859487897      80      81
chrX    156040895       2910941708      80      81
chrY    57227415        3068933262      80      81
chrM    16569   3126876142      80      81

eberle commented 6 years ago

Hi,

I'm glad that you are using this data. Regarding your question the bam file does not "correspond" to a particular vcf file. The "platinum" calls were made by merging a variety of pipelines so there were multiple alignments (primarily bwa, Isaac). The data submitted to ENA was from bam files aligned with Isaac on hg19 so it will not match up with the chromosome information if you are using the hg38 calls. Because these calls are based on a variety of different informatics pipelines we also don't expect a single pipeline to replicate everything but re-aligning this data against hg38 using bwa should provide a very close approximation.

ksw9 commented 6 years ago

Thanks for this - it's really helpful to understand how the different resources can be related to each other! Best,

ksw9 commented 6 years ago

Hi, Are the original BAM files available? I have tried remapping ERR194146_1.fastq.gz and ERR194146_2.fastq.gz with bwa mem and am running into a strange issue - the @SQ header lines get duplicated within the BAM. I am using the recommend reference: Homo_sapiens/NCBI/GRCh38Decoy/Sequence/WholeGenomeFasta/genome.fa Do you have any suggestions what might be causing this error? Thank you!

Illumina / PlatinumGenomes

corresponding BAM file for NA12877.vcf.gz #6