Closed ksw9 closed 6 years ago
Hi,
I'm glad that you are using this data. Regarding your question the bam file does not "correspond" to a particular vcf file. The "platinum" calls were made by merging a variety of pipelines so there were multiple alignments (primarily bwa, Isaac). The data submitted to ENA was from bam files aligned with Isaac on hg19 so it will not match up with the chromosome information if you are using the hg38 calls. Because these calls are based on a variety of different informatics pipelines we also don't expect a single pipeline to replicate everything but re-aligning this data against hg38 using bwa should provide a very close approximation.
Thanks for this - it's really helpful to understand how the different resources can be related to each other! Best,
Hi, Are the original BAM files available? I have tried remapping ERR194146_1.fastq.gz and ERR194146_2.fastq.gz with bwa mem and am running into a strange issue - the @SQ header lines get duplicated within the BAM. I am using the recommend reference: Homo_sapiens/NCBI/GRCh38Decoy/Sequence/WholeGenomeFasta/genome.fa Do you have any suggestions what might be causing this error? Thank you!
Hi, We are exploring the platinum truthset variant calls for NA12877 (aws s3 cp s3://platinum-genomes/2017-1.0 pg2017 --recursive). Which is the correct corresponding BAM file from which these variant calls were made (i.e. the BAM to match this VCF)?
The ENA BAM file for NA12877 (ftp://ftp.sra.ebi.ac.uk/vol1/ERA172/ERA172924/bam/NA12877_S1.bam) has a header which does not match the ref genome Homo_sapiens/NCBI/GRCh38Decoy/Sequence/WholeGenomeFasta/genome.fa reported in the NA12877.vcf.gz.
For example the BAM file header:
While the reference fasta dictionary seems to be a different reference sequence.